Regular expressions: parsing an "OLEDB like" connection string ...

M

Martin Robins

I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it.

The string I am trying to parse is as follows:
commandText=insert into [Trace] (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category
I am looking to retrive name value pairs where the name is the value before the equals and the value is everything after the equals, with each pair being delimited by the semi-colon ...
Name Value
commandText "insert into [Trace] (Text) values (@message + N': ' + @category)"
commandType StoredProcedure
message @message
category @category
The regular expression code is as follows:
Regex regex = new Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)", RegexOptions.ExplicitCapture);
foreach (Match match in regex.Matches(initializeData)) {
string name = match.Groups[@"name"].Value.Trim().ToLower(), value = match.Groups[@"value"].Value.Trim();
switch (name) {
...
}
}
The first match found reveals a name of "insert into ..." through to "commandType". It then matches on message=@message and category=@category.

If I change the string being parsed to:
commandText=insertTrace;commandType=StoredProcedure; message=@message; category=@category
It is parsed correctly, picking up the 4 name/value pairs.

I know the problem lies in the regular expression I am using; I need to cope with the brackets and the colon.

Can anybody offer me any further help with the expression?

Cheers.
 
N

Nathan Kovac

Make this easy on yourself and use commandText.Split.
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it.

The string I am trying to parse is as follows:
commandText=insert into [Trace] (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category
I am looking to retrive name value pairs where the name is the value before the equals and the value is everything after the equals, with each pair being delimited by the semi-colon ...
Name Value
commandText "insert into [Trace] (Text) values (@message + N': ' + @category)"
commandType StoredProcedure
message @message
category @category
The regular expression code is as follows:
Regex regex = new Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)", RegexOptions.ExplicitCapture);
foreach (Match match in regex.Matches(initializeData)) {
string name = match.Groups[@"name"].Value.Trim().ToLower(), value = match.Groups[@"value"].Value.Trim();
switch (name) {
...
}
}
The first match found reveals a name of "insert into ..." through to "commandType". It then matches on message=@message and category=@category.

If I change the string being parsed to:
commandText=insertTrace;commandType=StoredProcedure; message=@message; category=@category
It is parsed correctly, picking up the 4 name/value pairs.

I know the problem lies in the regular expression I am using; I need to cope with the brackets and the colon.

Can anybody offer me any further help with the expression?

Cheers.
 
M

Martin Robins

Whilst I agree that I could split the entire string on the semi-colons and then split each result on the equals, this is not the way that I want to go about it. I would prefer to do this using regular expressions because I am trying to understand regular expressions!

It took me about a day to work out the original expression correctly, and I thought I had done well. Then I changed the data being parsed (to include the brackets etc.) and it broke. I keep looking through the syntax and google examples and I just cannot see wht it does not work and this is why I would like somebody to explain it.

Thanks anyway for your input; the answer you provided was perfectly valid.

Cheers.
Make this easy on yourself and use commandText.Split.
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it.

The string I am trying to parse is as follows:
commandText=insert into [Trace] (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category
I am looking to retrive name value pairs where the name is the value before the equals and the value is everything after the equals, with each pair being delimited by the semi-colon ...
Name Value
commandText "insert into [Trace] (Text) values (@message + N': ' + @category)"
commandType StoredProcedure
message @message
category @category
The regular expression code is as follows:
Regex regex = new Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)", RegexOptions.ExplicitCapture);
foreach (Match match in regex.Matches(initializeData)) {
string name = match.Groups[@"name"].Value.Trim().ToLower(), value = match.Groups[@"value"].Value.Trim();
switch (name) {
...
}
}
The first match found reveals a name of "insert into ..." through to "commandType". It then matches on message=@message and category=@category.

If I change the string being parsed to:
commandText=insertTrace;commandType=StoredProcedure; message=@message; category=@category
It is parsed correctly, picking up the 4 name/value pairs.

I know the problem lies in the regular expression I am using; I need to cope with the brackets and the colon.

Can anybody offer me any further help with the expression?

Cheers.
 
N

Nathan Kovac

I am not at all experienced with regular expressions so I would not be of much help. I have only had one project that used them and my brother coded those functions.

Whilst I agree that I could split the entire string on the semi-colons and then split each result on the equals, this is not the way that I want to go about it. I would prefer to do this using regular expressions because I am trying to understand regular expressions!

It took me about a day to work out the original expression correctly, and I thought I had done well. Then I changed the data being parsed (to include the brackets etc.) and it broke. I keep looking through the syntax and google examples and I just cannot see wht it does not work and this is why I would like somebody to explain it.

Thanks anyway for your input; the answer you provided was perfectly valid.

Cheers.
Make this easy on yourself and use commandText.Split.
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it.

The string I am trying to parse is as follows:
commandText=insert into [Trace] (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category
I am looking to retrive name value pairs where the name is the value before the equals and the value is everything after the equals, with each pair being delimited by the semi-colon ...
Name Value
commandText "insert into [Trace] (Text) values (@message + N': ' + @category)"
commandType StoredProcedure
message @message
category @category
The regular expression code is as follows:
Regex regex = new Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)", RegexOptions.ExplicitCapture);
foreach (Match match in regex.Matches(initializeData)) {
string name = match.Groups[@"name"].Value.Trim().ToLower(), value = match.Groups[@"value"].Value.Trim();
switch (name) {
...
}
}
The first match found reveals a name of "insert into ..." through to "commandType". It then matches on message=@message and category=@category.

If I change the string being parsed to:
commandText=insertTrace;commandType=StoredProcedure; message=@message; category=@category
It is parsed correctly, picking up the 4 name/value pairs.

I know the problem lies in the regular expression I am using; I need to cope with the brackets and the colon.

Can anybody offer me any further help with the expression?

Cheers.
 
G

Guest

Conn Strings always separate by semicolon, so it is just as easy to split and
then split again on = to get the key/value pairs. You can do this either with
string.Split() or Regex.Split(), depending on your aim (apologies for not
reading in detail).

While this has to be refactored eventually, we created a small app to do
something similar. The history is the developer forgot about one of the two
strings necessary. If the second string was missing, the following code was
run:

private string GetMetadataConnString(string connString)
{
//TODO: Make char array static
string[] connStringBroken = connString.Split(";".ToCharArray());
StringBuilder builder = new StringBuilder();

for(int i=0;i<connStringBroken.Length;i++)
{
//TODO: Make char array static
string[] keyValue = connStringBroken.Split("=".ToCharArray());

builder.Append(keyValue[0]);
builder.Append("=");

if((keyValue[0].ToUpper()=="DATABASE")||
(keyValue[0].ToUpper()=="INITIAL CATALOG"))
builder.Append("MetadataDB");
else
builder.Append(keyValue[1]);

builder.Append(";");
}

return builder.ToString();
}

There are a couple of things that can be done to refactor for performance,
etc., but it is easy to maintain, which is a win.

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA

***************************
Think Outside the Box!
***************************


Martin Robins said:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it.

The string I am trying to parse is as follows:
commandText=insert into [Trace] (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category
I am looking to retrive name value pairs where the name is the value before the equals and the value is everything after the equals, with each pair being delimited by the semi-colon ...
Name Value
commandText "insert into [Trace] (Text) values (@message + N': ' + @category)"
commandType StoredProcedure
message @message
category @category
The regular expression code is as follows:
Regex regex = new Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)", RegexOptions.ExplicitCapture);
foreach (Match match in regex.Matches(initializeData)) {
string name = match.Groups[@"name"].Value.Trim().ToLower(), value = match.Groups[@"value"].Value.Trim();
switch (name) {
...
}
}
The first match found reveals a name of "insert into ..." through to "commandType". It then matches on message=@message and category=@category.

If I change the string being parsed to:
commandText=insertTrace;commandType=StoredProcedure; message=@message; category=@category
It is parsed correctly, picking up the 4 name/value pairs.

I know the problem lies in the regular expression I am using; I need to cope with the brackets and the colon.

Can anybody offer me any further help with the expression?

Cheers
 
H

Hans Kesting

Search for tools like "The Regulator" and "Regex Coach", they might help you
test your regexes.

Hans Kesting

Whilst I agree that I could split the entire string on the semi-colons and then split each result on the equals, this is not the way that I want to go about it. I would prefer to do this using regular expressions because I am trying to understand regular expressions!
 
G

Greg Bacon

: The string I am trying to parse is as follows:
: commandText=insert into [Trace] (Text) values (@message + N': ' +
: @category);commandType=StoredProcedure; message=@message;
: category=@category
: [...]
: The regular expression code is as follows:
: Regex regex = new
: Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)",
: RegexOptions.ExplicitCapture);

Part of your problem is that most metacharacters lose their special
meanings inside character classes. I doubt that you meant to say
that a value is zero or more characters that aren't parentheses,
question mark, colon, semicolon, pipe, and dollar sign.

The trickier part was figuring out why it matched the first *name*
as "insert into...commandType". At first, I thought it might have
been a longest-leftmost issue[*], but then I realized it was due to
a combination of the character class misunderstanding and your trailing
"anchor."

[*] A POSIX thing -- see pg. 116 of Friedl's *Mastering Regular
Expressions* or http://shurl.org/friedl-longest-leftmost

When the matching engine tries the real first value ("insert...
@category)"), it sees the left parenthesis before @message and
says, 'Wait, a value can't have any parentheses because of the
given character class."

It then tries to backtrack, but it can't match the trailing anchor,
i.e., there's no semicolon or end-of-line to the left of that
paren before @message.

'Okay,' it thinks, 'I must've matched a bad substring for name,'
but a name is zero or more characters that aren't equals signs.
The next place that can start is "insert into...", and the greedy
star quantifier sucks up everything up to "commandType". The
rest of the pattern can match from there, and that explains the
faulty match.

Consider the following snippet:

static void Main(string[] args)
{
string str =
@"commandText=insert into [Trace] (Text) values (@message + N': ' +
@category);commandType=StoredProcedure ; message=@message;
category=@category";

Regex nameval = new Regex(
@"(?<name>\S+)\s*=\s*(?<val>[^;]+?)\s*(;|$)",
RegexOptions.Singleline);

foreach (Match m in nameval.Matches(str))
{
Console.WriteLine(
"name=[{0}], val=[{1}]",
m.Groups["name"].ToString(),
m.Groups["val"].ToString());
}
}

Its output is

name=[commandText], val=[insert into [Trace] (Text) values (@message + N': ' +
@category)]
name=[commandType], val=[StoredProcedure]
name=[message], val=[@message]
name=[category], val=[@category]

Here we define a name as a run of non-whitespace characters (\S+). By
matching optional whitespace (\s*) and excluding it from the capturing
parentheses, we save the trim steps from your code.

The val subpattern is similar: a val is a run of non-semicolon
characters. One place to pay attention is the +? quantifier. Remember
that * (zero or more of..) and + (one or more of..) are greedy: they
grab as much text as they can. The ? versions (think of them as
cautious or timid) are very anxious to turn control over to the next
part of the expression.

If the val subpattern had been [^;]+ instead of [^;]+?, any trailing
whitespace would be consumed as part of val, but \s* would still happily
matched the empty string. (Remember that starred expressions *always*
succeed, although perhaps by matching nothing.)

This is mostly a polish issue. Using the non-greedy plus gives \s*
a chance to throw away whitespace. Again, this saves the extra trim
steps.

One more important note: because the final name-val pair may be
terminated by end-of-string instead of a semicolon, use of
RegexOptions.Singleline is important because it changes $ to mean
only end-of-string. (I wasn't sure if the newlines in your example
were an artifact of posting to Usenet or whether they might actually
be there, so I took the conservative route.)

I hope this helps.

Greg
 
M

Martin Robins

This is exactly what I was looking for (I think; I am going to have to read
it a couple more times to understand it fully) - The string was indeed a
single line but your use of RegexOptions.SingleLine will not hurt anyway as
the string to be parsed is in an XML configuration file and the results are
what I expected.

Thanks for your help.

Martin.

Greg Bacon said:
: The string I am trying to parse is as follows:
: commandText=insert into [Trace] (Text) values (@message + N': ' +
: @category);commandType=StoredProcedure; message=@message;
: category=@category
: [...]
: The regular expression code is as follows:
: Regex regex = new
: Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)",
: RegexOptions.ExplicitCapture);

Part of your problem is that most metacharacters lose their special
meanings inside character classes. I doubt that you meant to say
that a value is zero or more characters that aren't parentheses,
question mark, colon, semicolon, pipe, and dollar sign.

The trickier part was figuring out why it matched the first *name*
as "insert into...commandType". At first, I thought it might have
been a longest-leftmost issue[*], but then I realized it was due to
a combination of the character class misunderstanding and your trailing
"anchor."

[*] A POSIX thing -- see pg. 116 of Friedl's *Mastering Regular
Expressions* or http://shurl.org/friedl-longest-leftmost

When the matching engine tries the real first value ("insert...
@category)"), it sees the left parenthesis before @message and
says, 'Wait, a value can't have any parentheses because of the
given character class."

It then tries to backtrack, but it can't match the trailing anchor,
i.e., there's no semicolon or end-of-line to the left of that
paren before @message.

'Okay,' it thinks, 'I must've matched a bad substring for name,'
but a name is zero or more characters that aren't equals signs.
The next place that can start is "insert into...", and the greedy
star quantifier sucks up everything up to "commandType". The
rest of the pattern can match from there, and that explains the
faulty match.

Consider the following snippet:

static void Main(string[] args)
{
string str =
@"commandText=insert into [Trace] (Text) values (@message + N': ' +
@category);commandType=StoredProcedure ; message=@message;
category=@category";

Regex nameval = new Regex(
@"(?<name>\S+)\s*=\s*(?<val>[^;]+?)\s*(;|$)",
RegexOptions.Singleline);

foreach (Match m in nameval.Matches(str))
{
Console.WriteLine(
"name=[{0}], val=[{1}]",
m.Groups["name"].ToString(),
m.Groups["val"].ToString());
}
}

Its output is

name=[commandText], val=[insert into [Trace] (Text) values (@message +
N': ' +
@category)]
name=[commandType], val=[StoredProcedure]
name=[message], val=[@message]
name=[category], val=[@category]

Here we define a name as a run of non-whitespace characters (\S+). By
matching optional whitespace (\s*) and excluding it from the capturing
parentheses, we save the trim steps from your code.

The val subpattern is similar: a val is a run of non-semicolon
characters. One place to pay attention is the +? quantifier. Remember
that * (zero or more of..) and + (one or more of..) are greedy: they
grab as much text as they can. The ? versions (think of them as
cautious or timid) are very anxious to turn control over to the next
part of the expression.

If the val subpattern had been [^;]+ instead of [^;]+?, any trailing
whitespace would be consumed as part of val, but \s* would still happily
matched the empty string. (Remember that starred expressions *always*
succeed, although perhaps by matching nothing.)

This is mostly a polish issue. Using the non-greedy plus gives \s*
a chance to throw away whitespace. Again, this saves the extra trim
steps.

One more important note: because the final name-val pair may be
terminated by end-of-string instead of a semicolon, use of
RegexOptions.Singleline is important because it changes $ to mean
only end-of-string. (I wasn't sure if the newlines in your example
were an artifact of posting to Usenet or whether they might actually
be there, so I took the conservative route.)

I hope this helps.

Greg
 
G

Greg Bacon

: This is exactly what I was looking for (I think; I am going to have to
: read it a couple more times to understand it fully) - The string was
: indeed a single line but your use of RegexOptions.SingleLine will not
: hurt anyway as the string to be parsed is in an XML configuration file
: and the results are what I expected.
:
: Thanks for your help.

I'm glad to help. Let me know if there are places I can provide
clearer explanations.

Greg
 
M

Martin Robins

New problem;

If I insert a new parameter before my commandText, it corrupts again!

new parse string is ...

database=Application;commandText=insert into [Trace] (Text) values(@category
+ N': ' + @message);commandType=StoredProcedure; message=@message;
category=@category

All a single line again, but in this instance the first "name" returned is
"database=Application;commandText" with a "value" of "insert into [Trace]
(Text) values(@category + N': ' + @message);" - After this initial hickup,
it catches up and reports the remaining pairs correctly.

Can I be pushy and ask for more help?

Cheers.

Martin.




Greg Bacon said:
: The string I am trying to parse is as follows:
: commandText=insert into [Trace] (Text) values (@message + N': ' +
: @category);commandType=StoredProcedure; message=@message;
: category=@category
: [...]
: The regular expression code is as follows:
: Regex regex = new
: Regex(@"(?<name>[^=]*)=(?<value>[^(?:;|$)]*)(?:;|$)",
: RegexOptions.ExplicitCapture);

Part of your problem is that most metacharacters lose their special
meanings inside character classes. I doubt that you meant to say
that a value is zero or more characters that aren't parentheses,
question mark, colon, semicolon, pipe, and dollar sign.

The trickier part was figuring out why it matched the first *name*
as "insert into...commandType". At first, I thought it might have
been a longest-leftmost issue[*], but then I realized it was due to
a combination of the character class misunderstanding and your trailing
"anchor."

[*] A POSIX thing -- see pg. 116 of Friedl's *Mastering Regular
Expressions* or http://shurl.org/friedl-longest-leftmost

When the matching engine tries the real first value ("insert...
@category)"), it sees the left parenthesis before @message and
says, 'Wait, a value can't have any parentheses because of the
given character class."

It then tries to backtrack, but it can't match the trailing anchor,
i.e., there's no semicolon or end-of-line to the left of that
paren before @message.

'Okay,' it thinks, 'I must've matched a bad substring for name,'
but a name is zero or more characters that aren't equals signs.
The next place that can start is "insert into...", and the greedy
star quantifier sucks up everything up to "commandType". The
rest of the pattern can match from there, and that explains the
faulty match.

Consider the following snippet:

static void Main(string[] args)
{
string str =
@"commandText=insert into [Trace] (Text) values (@message + N': ' +
@category);commandType=StoredProcedure ; message=@message;
category=@category";

Regex nameval = new Regex(
@"(?<name>\S+)\s*=\s*(?<val>[^;]+?)\s*(;|$)",
RegexOptions.Singleline);

foreach (Match m in nameval.Matches(str))
{
Console.WriteLine(
"name=[{0}], val=[{1}]",
m.Groups["name"].ToString(),
m.Groups["val"].ToString());
}
}

Its output is

name=[commandText], val=[insert into [Trace] (Text) values (@message +
N': ' +
@category)]
name=[commandType], val=[StoredProcedure]
name=[message], val=[@message]
name=[category], val=[@category]

Here we define a name as a run of non-whitespace characters (\S+). By
matching optional whitespace (\s*) and excluding it from the capturing
parentheses, we save the trim steps from your code.

The val subpattern is similar: a val is a run of non-semicolon
characters. One place to pay attention is the +? quantifier. Remember
that * (zero or more of..) and + (one or more of..) are greedy: they
grab as much text as they can. The ? versions (think of them as
cautious or timid) are very anxious to turn control over to the next
part of the expression.

If the val subpattern had been [^;]+ instead of [^;]+?, any trailing
whitespace would be consumed as part of val, but \s* would still happily
matched the empty string. (Remember that starred expressions *always*
succeed, although perhaps by matching nothing.)

This is mostly a polish issue. Using the non-greedy plus gives \s*
a chance to throw away whitespace. Again, this saves the extra trim
steps.

One more important note: because the final name-val pair may be
terminated by end-of-string instead of a semicolon, use of
RegexOptions.Singleline is important because it changes $ to mean
only end-of-string. (I wasn't sure if the newlines in your example
were an artifact of posting to Usenet or whether they might actually
be there, so I took the conservative route.)

I hope this helps.

Greg
 
G

Greg Bacon

: If I insert a new parameter before my commandText, it corrupts again!
:
: new parse string is ...
:
: database=Application;commandText=insert into [Trace] (Text)
: values(@category + N': ' + @message);commandType=
: StoredProcedure; message=@message; category=@category

It needs a one-character change to the pattern:

Regex nameval = new Regex(
@"(?<name>\S+?)\s*=\s*(?<val>[^;]+?)\s*(;|$)",
RegexOptions.Singleline);

Look in the name subpattern: from \S+ to \S+? is the change.

The greedy version grabs as many non-whitespace characters as it
can, which, as you saw, erroneously includes the = in the first pair
specification.

An equivalent pattern argument is

@"(?<name>[^=\s]+)\s*=\s*(?<val>[^;]+?)\s*(;|$)"

Being greedy is fine here because the character class won't let it
accidentally pick up an equals sign.

Sorry about the mistake.

Greg
 
M

Martin Robins

I have tried various combinations and both of these do the trick perfectly.
I have chosen to use the one you described as "cautious".

Thanks again; you really have nothing to be sorry about - you have done in
less than an hour what I have spent an eternity on!

Martin.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top