Parsing using RE?

  • Thread starter Thread starter Ravi Singh (UCSD)
  • Start date Start date
R

Ravi Singh (UCSD)

Hello all

I have a huge string that I need to parse

Key <Delim1> Value <Delim2> Key <Delim1> Value <Delim2> Key <Delim1>
Value <Delim3>

Key <Delim1> Value <Delim2> Key <Delim1> Value <Delim2> Key <Delim1>
Value <Delim3>

repeat for a couple hundred thousand times

The <Delim1> seprates the Key, Value pair
<Delim2> seprates two different Key,Value pairs
<Delim2> seprates records.

I need to get the Key Value pairs and populate a table with that
information.

Would the .NET regular expressions be worth while and how would I go
about doing it in a clean optimized fashion.


Thanks

-Ravi Singh
 
yes, definately.

you'll need to write you own reg exp tho

i'd recommend using an app called expresso. free reg exp tester/builder.
http://www.ultrapico.com/Expresso.htm

if all the delimiters are unique definately use a reg exp. else, you'll be
looping "while (str.indexOf("<Delim")) { ..." etc.
using regular expression to find matches would be much quicker, and return
array of matches (and fields)

if you get stuck, repost.

HTH
sam
 
string input = "Key <Delim1> Value <Delim2> Key <Delim1> Value <Delim2>
Key <Delim1> Value <Delim3>Key <Delim1> Value <Delim2> Key <Delim1>
Value <Delim2> Key <Delim1> Value <Delim3>";

Regex delim1 = new Regex("<Delim1>");
Regex delim2 = new Regex("<Delim2>");
Regex delim3 = new Regex("<Delim3>");

string[] rets3 = delim3.Split(input);
string[] rets2 = delim2.Split(String.Concat(rets3));
string[] rets1 = delim1.Split(String.Concat(rets2));

rets 2 and rets 1 is not what I expect it to be. =(. any ideas?

Thanks
-Ravi.
 
Could you post the solution so we can see it? It might help someone
else in the same situation some day.
 
Ravi,
In addition to the other comments.

You could use a While loop with Match.NextMatch.

Something like:

string pattern = @"(?<key>\w+)=(?<value>\w+)(:;|)";
string input = "a=1;b=2;c=3;d=4;e=5;";

Regex parser = new Regex(pattern, RegexOptions.Compiled);

Match match = parser.Match(input);
while (match.Success)
{
Debug.WriteLine(match.Groups["key"], "key");
Debug.WriteLine(match.Groups["value"], "value");
match = match.NextMatch();
}

Where "=" is Delim1 & ";" is Delim2, depending on how important Delim3 is I
would consider using String.SubString to extract the input upto Delim3 then
use the above code...

Hope this helps
Jay
 
string input = "Key <Delim1> Value <Delim2> Key <Delim1> Value <Delim2>
Key <Delim1> Value <Delim2> Key <Delim1> Value <Delim2> Key <Delim1>
Value <Delim2> Key <Delim1> Value <Delim3>";

Regex delim1 = new Regex("<Delim1>");
Regex delim2 = new Regex("<Delim2>");
Regex delim3 = new Regex("<Delim3>");

string[] rets3 = delim3.Split(input);
string[] rets2 = delim2.Split(String.Concat(rets3));
string[] rets1 = delim1.Split(String.Concat(rets2));

There it is I concat it, however a join might be more appropriate.

Thanks
 
Ravi said:
Would the .NET regular expressions be worth while and how would I go
about doing it in a clean optimized fashion.
RegEx? I'd use PERL :)

hjf
 
Here's a little snippet I wrote to do this kind of thing with just 2
delimiters, one to separate the key-value pairs and another to split
apart each actual pair. Since both delimiters are arrays however, you
can specify any number of different delimiters, so in your case you may
have outerDelimiters == { "<Delim2>", "<Delim3>" } ... if I understand
correctly what it is you are after.

Though I haven't tested it, I'm pretty sure the String.Split method
will be much faster than using Regular Expressions; even a simple RE
requires the costly construction of some internal data structures to do
the job, and the RE routines will at least have to do everything that
String.Split() has to do anyway. However if your delimiters are not
predictable recurring strings, RE would be a better way.

The code:

====================================================
public class NameValueCollectionEx : NameValueCollection
{
public void LoadFrom(string source, string[] outerDelimiters,
string[] innerDelimiters)
{
// using this constructor is due to be obsoleted in .NET 2.0,
// use StringSplitOptions enum instead
string[] pairs = source.Split(outerDelimiters, true);

foreach ( string pair in pairs ) {
string[] elements = pair.Split(innerDelimiters, 2, true);
this.Add(elements[0], elements[1]);
}
}
}
====================================================

I don't think you can get things a whole lot more optimized than this.
Though if anyone feels inspired to do a performance comparison vs. RE,
I'd be interested in seeing the results.

Joel
 
Back
Top