parsing VB code with a regex

M

Mark

I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}

I was thinking that using regular expressions might be a neat way to solve
this, but I am new to them. Can anyone give me a hint here?

The catch is, it must only find tokens that are not quoted and not
commented; examples follow

(1) should find {Area} (both occurrences) and {Height} in this string
If {Area} > 100 Then Return {Area} Else Return {Height}

(2) should find {Area}, but not {AreaString} in this string
If {Area} = "{AreaString}" 100 Then Return "Found it!"

(3) should find {Height}, but not {Area} in this multi-line string
'the {Area} token is not used here
If {Height} > 1000 Then
Return "Tall"
Else
Return "Short"
End If

I've searched many web sites and libraries, but they all seem to be
interested in finding quoted strings, not avoiding them. I'd appreciate any
help.

Emby
 
J

Jon Shemitz

Mark said:
I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}
The catch is, it must only find tokens that are not quoted and not
commented; examples follow

Iow, you want to match all {tokens} except those 1) between double
quotes and 2) between a single quote and a line end. Right?
I've searched many web sites and libraries, but they all seem to be
interested in finding quoted strings, not avoiding them.

Yeah, I'm not surprised. This is not trivial. Both these regexes use
the same 3 options - IgnoreCase | Multiline |
IgnorePatternWhitespace - and definitely NOT Singleline.

First pass, ignores comments:

(?<! ' .* ) # Can't be to right of ' char
\{ (?<token> [a-z_0-9]\w+ ) \} # Capture a {token}

Because Singleline is NOT set, the (?<! "negative lookbehind
assertion" rules out anything to the right of a ' character.

Second pass, almost there:

(?<! ' .* ) # Can't be to right of ' char
(?<! ^ .* " ) # Can't be to right of " char

\{ (?<token>[a-z_0-9]\w+ ) \} # Capture a {token}

(?! " .* $ ) # Can't be to left of " char

.... works on your examples and on

If {Area} = "{AreaString}" 100 Then Return "Found it!" {foo}

but (alas!) it also matches the {foo} in

If {Area} = "{AreaString}" 100 Then Return Found it!" {foo}

I have to go to bed early to volunteer at the polls all day
tomorrow. If you can't take it from here, I'll try to take it
farther on Wednesday. (Or, I'll try to remember to - feel free
to send me mail Tues night or Wedn morning.)


--

..NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
E

Emby

Hi Jon,

Thanks for the response. I'm very new to RE, but my hope that I can achieve this task with an RE is waning. I took what you provided (thanks!), and enhanced it a bit to allow for a bit more freedom in token chars (must start with alpha, but allow embedded spaces, dots, hyphen and underscore, and also disallow a space just before the closing brace:

(?<! ' .* ) # Can't be to right of ' char
(?<! ^ .* " ) # Can't be to right of " char
\{(?<token>[a-z]([a-z_0-9 \-.])*)[^ ]\} # Capture a {token}
(?! " .* $ ) # Can't be to left of " char

In this example text:

If {Area} > 100 Then
Return "Hello" & {Area String} & {A}
Else
Return {Height}
End If
If {_Area} = "{Area String}" * 100 Then Return "Found it!"
'the {Area} token is not used here
If {Height} > 1000 Then
Return "hello "" y" & "Width" & "Tall" & {breadth}
Else
Return "Yes" & {Short_One - Two.Three} & "Short"
End If
If {Width} = " {Width} " Then Return "Dang!"
Return "Hello ' Yes" & {Last One}

Found tokens bolded; in case you can't see bolding, tokens are:
Line Token
1 {Area}
2 {Area String}
4 {Height}
6 {Area String} this one surprised me
8 {Height}
9 {breadth}
11 {Short_One - Two.Three}
13 {Width} - both instances

Newbie question: why is {A} on line 2 not captured? I would think that
[a-z]([a-z_0-9 \-.])*
says: 1 alpha char followed by zero or more [a-z_0-9 \-.]

The above is good, except that in a perfect world, I would not want the second {Width} found on line 13 because it is within a quoted string, and I would want {Last One} on line 14 found, because the comment character is within a quoted string.

And lastly, although I did not originally pose it this way (RE newbie that I am), I will actually know the set of all defined tokens in advance. I just have to find all instances of a set of defined tokens that are not commented or within quoted strings. The funny thing is that, even though one would think that would make the problem simpler, I can't see how to do that. If I use something like:
\{(?<token>height)\} # Capture a {token}
to find the {Height} token, that works; but
\{(?<token>Area String)\} # Capture a {token}
fails to capture {Area String}
but \{(?<token>Area[ ]String)\} # Capture a {token} works!

What am I missing here? I can see it's the space, but I don't understand why.

Thanks again for your help.

Jon Shemitz said:
Mark said:
I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}
The catch is, it must only find tokens that are not quoted and not
commented; examples follow

Iow, you want to match all {tokens} except those 1) between double
quotes and 2) between a single quote and a line end. Right?
I've searched many web sites and libraries, but they all seem to be
interested in finding quoted strings, not avoiding them.

Yeah, I'm not surprised. This is not trivial. Both these regexes use
the same 3 options - IgnoreCase | Multiline |
IgnorePatternWhitespace - and definitely NOT Singleline.

First pass, ignores comments:

(?<! ' .* ) # Can't be to right of ' char
\{ (?<token> [a-z_0-9]\w+ ) \} # Capture a {token}

Because Singleline is NOT set, the (?<! "negative lookbehind
assertion" rules out anything to the right of a ' character.

Second pass, almost there:

(?<! ' .* ) # Can't be to right of ' char
(?<! ^ .* " ) # Can't be to right of " char

\{ (?<token>[a-z_0-9]\w+ ) \} # Capture a {token}

(?! " .* $ ) # Can't be to left of " char

... works on your examples and on

If {Area} = "{AreaString}" 100 Then Return "Found it!" {foo}

but (alas!) it also matches the {foo} in

If {Area} = "{AreaString}" 100 Then Return Found it!" {foo}

I have to go to bed early to volunteer at the polls all day
tomorrow. If you can't take it from here, I'll try to take it
farther on Wednesday. (Or, I'll try to remember to - feel free
to send me mail Tues night or Wedn morning.)


--

.NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
K

Kevin Spencer

You need to define your rules more exactly. Examples are not specific
I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}

This leaves a lot of room for interpretation, something which computers are
extremely poor at, and humans not much better. The word "token" simply means
a series of characters without spaces between them. "{Formula}" as an
example, without any rules, implies nothing. I cannot, for example, assume
that by this example, your tokens will or must always have curly brackets
around them. It does not necessarily imply whether or not spaces may appear
between the curly brackets (if required) and the characters inside them.

Your example shows:
(1) should find {Area} (both occurrences) and {Height} in this string
If {Area} > 100 Then Return {Area} Else Return {Height}

Again, are the characters always supposed to be surrounded by curly
brackets? Should they have curly brackets at all, or are you just using them
to "highlight" what you are talking about?
(2) should find {Area}, but not {AreaString} in this string
If {Area} = "{AreaString}" 100 Then Return "Found it!"

What should it match in the following example?
If {Area} = "{Area String}" 100 Then Return "Found it!"

How about this one?

If {Area} = {"Area" "String") Then Return "Found it!"
(3) should find {Height}, but not {Area} in this multi-line string
'the {Area} token is not used here
If {Height} > 1000 Then
Return "Tall"
Else
Return "Short"
End If

Does this mean that it should ignore commented lines?

The first step to writing a regular expression is to define the rules that
comprise the pattern to match. If you can define these rules without any
examples (that is, if the rules are exactly defined, no examples will be
needed), I can write you a regular expression.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Nyuck nyuck nyuck
 
E

Emby

Hi Kevin,

You are indeed correct. The original question was a more general, "how can this be done with RE's?"

To be specific, I will have a set of strings, which I will call tokens, each of which will consist of upper case alpha-numeric characters in curly brackets. I also have another set of strings which is the translated value of these tokens. So If I have 5 tokens, I will have 5 translated values, one for each token.

I will also have a code snippet - a potentially multi-line string - which will contain embedded tokens. My task is to replace the tokens in the code snippet string with their translated values. The snippet is a VB code string. But:

1) any token in a code line to the right of a single quote character which is not itself in a quoted string should not be replaced
2) any token that is within a quoted string should not be translated

Sorry, but I'm giving examples coz I'm not sure I've described it well or completely :)
Known Tokens Translation
{AREA} 7075
{HEIGHT} 2512
{WIDTH} 75
{FOO} "Yes"

Snippet Translated code
If {AREA}>1000 Then If 7075>1000 Then
Return "Large" Return "Large"

' {AREA} token not used ' {AREA} token not used
ElseIf {HEIGHT}>2500 Then ElseIf 2512>2500 Then
Return "Tall" Return "Tall"

ElseIf {WIDTH} >50 Then ElseIf 75 >50 Then
Return " ' " & {FOO} Return " ' " & "Yes"

Else Else
Return " is {FOO} !" Return " is {FOO} !"

End If End If


Our system compiles the resulting snippet on the fly and executes it to provide the app with a scripting capability.

Thanks for any help you can extend.
 
K

Kevin Spencer

Hi Emby,

Thank you, that was very helpful.

We know that the rules of VB dictate that a comment must be on a single
line, and that it is identified by a single quote that is not surrounded by
double-quotes. That is, if you wish to comment across multiple lines, you
must put a comment marker on each line, essentially creating a separate
comment on each line. Any characters to the right of the comment are
commented out of the code. We also know that the comment may appear at any
point in the line, not necessarily at the beginning.

So, now that we're down to a single string, and a few simple rules:

1. "token" is defined as any character data enclosed by curly brackets.
2. Any token inside a VB comment should be ignored.
3. Any token inside a matching pair of double-quotes should be ignored.
4. All other tokens should be matched.

And I came up with this:

(?m)(?<=^[^']*)'[^\n]+$|(?:"[^"]*"|({[^}]+}))

Let me explain a bit. This regular expression takes advantage of a
characteristic of regular expressions: Regular expressions consume a string
as they are parsed. That is, they move through a string in basically a
"forward-only" manner (other than "backtracking," which is a special case
used in lookarounds mainly). So, if a portion of a string is matched by one
regular expression, it is not available for further matching.

So, I worked backwards from matching tokens to the 2 exceptions where they
should *not* be matched. The token-matching regular expression is simple:

{[^}]+}

Translated, this says a match is a '{' character, followed by any number of
characters that are *not* '}' followed by a '}' character. Simple enough. It
matches every token in the string. Now we want to weed out the
non-qualifying tokens. Since the comment is the one that always weeds
everything out, I left that for last (first). You'll see why in a minute.

The rule for quoted tokens is expressed as follows:

"[^"]*"

It is similar to the first: a double-quote, followed by any number of
characters that are *not* a double-quote, followed by a double-quote.

Now, how do we get these 2 working together? We use the OR operator - '|'.
When we OR these together, we get this:

"[^"]*"|{[^}]+}

This seems to expand the number of matches, since matches are now *added*
that include non-tokens. Here's where the "consuming" aspect comes in. The
matches that match the first rule include matches of tokens inside the
double-quote pairs. So, the only real problem here is separating the 2
groups. So, we use a group (of course!).

"[^"]*"|({[^}]+})

At this point, all tokens are matched, including those inside double-quote
pairs. The only ones that we want are the ones inside "group 1" (the only
capturing group in the regular expression). So, by using that group, we
eliminate the matches inside the double-quote pairs.

We have one last hurdle now. We want to eliminate anything inside a comment.
I left this for last because the comment eliminates *everything* inside it,
including the double-quote pairs, and thus consumes the most of the 3 rules.
This will make the regular expression more efficient, as it has less work to
do with each match.

The rule for comments, again, is a bit more compkicated:

(?m)(?<=^[^']*)'[^\n]+$

First, it must limit a comment to a single line. This is done with the '^'
(start of string/line) and '$' (end of string/line) characters. I also used
the "(?m)" directive, which indicates that the '^' and '$' characters match
at new lines.

So, it begins with a positive look-behind: (?<=^[^']*) which means "the
following is *only* a match if preceded by this regular expression" followed
by the newline character, and a character group which indicates 0 or more
non-single-quotes. The condition applies to the rest of the regular
expression (without the condition matching - lookarounds do not consume) - a
single-quote, followed by 1 or more non-line-break characters, followed by a
line break or the end of the string.

This covers comments which begin in the middle of a line as well as at the
beginning. The lookbehind prevents the characters preceding the single-quote
from being consumed, thereby making them available for the other 2
conditions. I finished up by (1) grouping the second 2 regular expressions
into a single non-capturing group - (?:"[^"]*"|({[^}]+})), making them a
single alternative to the first, and ORing them all together.

In essense, it says, "Match the first (comment) group first. With what is
left over, match either the quoted strings, or the left-over tokens, and put
the left-over tokens into a group." You can do a regular expression match,
and use the values in Group 1 to do your replacements.

I tested it fairly thoroughly. Let me know if it works for you.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.



Hi Kevin,

You are indeed correct. The original question was a more general, "how can
this be done with RE's?"

To be specific, I will have a set of strings, which I will call tokens, each
of which will consist of upper case alpha-numeric characters in curly
brackets. I also have another set of strings which is the translated value
of these tokens. So If I have 5 tokens, I will have 5 translated values, one
for each token.

I will also have a code snippet - a potentially multi-line string - which
will contain embedded tokens. My task is to replace the tokens in the code
snippet string with their translated values. The snippet is a VB code
string. But:

1) any token in a code line to the right of a single quote character which
is not itself in a quoted string should not be replaced
2) any token that is within a quoted string should not be translated

Sorry, but I'm giving examples coz I'm not sure I've described it well or
completely :)
Known Tokens Translation
{AREA} 7075
{HEIGHT} 2512
{WIDTH} 75
{FOO} "Yes"

Snippet Translated code
If {AREA}>1000 Then If 7075>1000 Then
Return "Large" Return "Large"

' {AREA} token not used ' {AREA} token not used
ElseIf {HEIGHT}>2500 Then ElseIf 2512>2500 Then
Return "Tall" Return "Tall"

ElseIf {WIDTH} >50 Then ElseIf 75 >50 Then
Return " ' " & {FOO} Return " ' " & "Yes"

Else Else
Return " is {FOO} !" Return " is {FOO} !"

End If End If


Our system compiles the resulting snippet on the fly and executes it to
provide the app with a scripting capability.

Thanks for any help you can extend.
 
J

Jon Shemitz

Kevin said:
In essense, it says, "Match the first (comment) group first. With what is
left over, match either the quoted strings, or the left-over tokens, and put
the left-over tokens into a group." You can do a regular expression match,
and use the values in Group 1 to do your replacements.

Very clever - I'll have to remember this approach. It's much simpler
to exclude by consuming than to fart around with zero-width assertions
the way I did.

Emby, do note that each Group has a Success property, just like the
Match (a Group descendant, fwiw) does.

--

..NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
K

Kevin Spencer

Why, thank you Jon!

I had to give it some thought (more than I should have, I'm sure my boss
would agree!). Originally, I thought your solution was more along the lines
of what would work, but I have been studying Regular Expressions quite a bit
over the past six months, and somewhere along the way I discovered the
"consumption factor." I'm sure I must have seen someone else do it
originally. Just passing it along. But that's what we're all here for!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 
E

Emby

Thank you gentlemen

I have more learning to do before I really understand what's going on here.
I mean, I can see how each part of the RE works, your collective
explanations of each part was excellent. But I am not yet conversant with
the some terms; I am a bit confused by the names of the .NET classes Match,
Group and Capture.

I mean, you allocate a RegEx with its RE and options, and you can ask the
RegEx for its "Matches", each of which has a Success property. Um, if it
wasn't "successful", then why is it a match ??

OK - each "Match" has a string Value, which appears to be the substring of
the data that the Match matched. And it also has a collection of Captures,
each of which has an Index and a length. Not really sure what a Capture is
....

A Match also has a collection of Groups - each group also has a Success
flag, a Value, an Index and a length.

But wait - a Group also has a Captures collection ...

So I'm a bit confused ... I've read through the docs, but it's like walking
in circles because they all seem to be defined in terms of the others.

I'm sure I'll do fine once I get the basic terminology and concepts. If you
know of a decent primer, that might help.

Again, thanks for your help and tutelage

Cheers,

Emby
 
K

Kevin Spencer

Hi Emby,

It's not really as complicated as it seems. Some of your confusion arises
from the fact that Match inherits Group, and therefore has a few members
that are basically irrelevant, like Success.

From the beginning: A Regex class defines a pattern-matching string that is
used against a target string. Notice the word "matching" in there. Any
string which "matches" against the pattern is a match. There can be anywhere
from 0 on up of these Matches. The Match class is a match against the entire
Regular Expression. For example:

"This is a target string"

Simple Regular Expression: "\w+" - Matches 1 or more (consecutive) "word"
characters (digits and/or alpha characters).

This yields 5 Matches: "This" "is" "a" "target" and "string"

Each Match matches a portion of the target string against the entire Regular
Expression.

The following target string would yield 0 Matches: "...................
...................."

Grouping is a way of subdividing the Matches into "Groups" of sub-matches.
When you create Groups, there will always be Groups in the result, one per
Group defined in the Regular Expression, much as there is always a
MatchCollection, though it may be empty. For example:

This is a Regular Expression for finding Hyperlink HREFs in HTML:

(?i)<(a)[^>]*href=[^>]*[>]?(.*?)(?:</\1|/)>?

When used against the following target string (from the Google Home Page),
it yields 2 matches, because there are 2 HREFs in the text:

<a href=/services/>Business Solutions</a> - <a href=/intl/en/about.html>
About Google</a><span id=hp style="behavior:url(#default#homepage)"></span>

These are the 2 matches:

1. Matches[0].Value = <a href=/services/>Business Solutions</a>

2. Matches[1].Value = <a href=/intl/en/about.html>
About Google</a>

However, you may not want the entire link, but only the InnerHtml of the
link (the part between the tags). So, the Regular Expression has 2 Groups
defined, by enclosing portions of it in parentheses:

(a) matches the "anchor" (a) in the HREF
(.*?) matches everything between the tags.

Note: the other parenthetical expression - (?:</\1|/) is not a capturing
Group (doesn't appear in the Groups Collection). Non-capturing Groups are
often used to apply a set of rules only to a portion of the Regular
Expression. Non-capturing Groups begin with (?:

Now, when we apply these Groups to the Matches, we get the following Values
in the Matches:

1. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = Business Solutions

2. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = About Google

This should simply illustrate how a Match can contain multiple Groups. Any
of them can have a Success of false, if the value is empty. In a case like
yours, for example, where you're ORing patterns together, you might have put
each of the ORed patterns into a Group. In that case, let's say that each
Match only matched one Group. The other Groups would be empty.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 
E

Emby

Hi Kevin,

Thanks so much for taking the time to explain the basics; I really
appreciate your effort.

I'm sure it doesn't seem it, but I'm a 15 year veteran software engineer
(back then we were just called programmers), with deep experience in C, VB,
and now C# and VB.NET. But I'm a absolute newbie to regular expressions
(obviously). They are very interesting and seem powerful for solving text
parsing and manipulation problems.

Thanks again for your patience and clarity. Same to you Jon. You guys rock!

Cheers,

Emby
"My other computer is an IBM 360 series ... "

Kevin Spencer said:
Hi Emby,

It's not really as complicated as it seems. Some of your confusion arises
from the fact that Match inherits Group, and therefore has a few members
that are basically irrelevant, like Success.

From the beginning: A Regex class defines a pattern-matching string that
is used against a target string. Notice the word "matching" in there. Any
string which "matches" against the pattern is a match. There can be
anywhere from 0 on up of these Matches. The Match class is a match against
the entire Regular Expression. For example:

"This is a target string"

Simple Regular Expression: "\w+" - Matches 1 or more (consecutive) "word"
characters (digits and/or alpha characters).

This yields 5 Matches: "This" "is" "a" "target" and "string"

Each Match matches a portion of the target string against the entire
Regular Expression.

The following target string would yield 0 Matches: "...................
..................."

Grouping is a way of subdividing the Matches into "Groups" of sub-matches.
When you create Groups, there will always be Groups in the result, one per
Group defined in the Regular Expression, much as there is always a
MatchCollection, though it may be empty. For example:

This is a Regular Expression for finding Hyperlink HREFs in HTML:

(?i)<(a)[^>]*href=[^>]*[>]?(.*?)(?:</\1|/)>?

When used against the following target string (from the Google Home Page),
it yields 2 matches, because there are 2 HREFs in the text:

<a href=/services/>Business Solutions</a> - <a href=/intl/en/about.html>
About Google</a><span id=hp
style="behavior:url(#default#homepage)"></span>

These are the 2 matches:

1. Matches[0].Value = <a href=/services/>Business Solutions</a>

2. Matches[1].Value = <a href=/intl/en/about.html>
About Google</a>

However, you may not want the entire link, but only the InnerHtml of the
link (the part between the tags). So, the Regular Expression has 2 Groups
defined, by enclosing portions of it in parentheses:

(a) matches the "anchor" (a) in the HREF
(.*?) matches everything between the tags.

Note: the other parenthetical expression - (?:</\1|/) is not a capturing
Group (doesn't appear in the Groups Collection). Non-capturing Groups are
often used to apply a set of rules only to a portion of the Regular
Expression. Non-capturing Groups begin with (?:

Now, when we apply these Groups to the Matches, we get the following
Values in the Matches:

1. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = Business Solutions

2. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = About Google

This should simply illustrate how a Match can contain multiple Groups. Any
of them can have a Success of false, if the value is empty. In a case like
yours, for example, where you're ORing patterns together, you might have
put each of the ORed patterns into a Group. In that case, let's say that
each Match only matched one Group. The other Groups would be empty.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Emby said:
Thank you gentlemen

I have more learning to do before I really understand what's going on
here. I mean, I can see how each part of the RE works, your collective
explanations of each part was excellent. But I am not yet conversant with
the some terms; I am a bit confused by the names of the .NET classes
Match, Group and Capture.

I mean, you allocate a RegEx with its RE and options, and you can ask the
RegEx for its "Matches", each of which has a Success property. Um, if it
wasn't "successful", then why is it a match ??

OK - each "Match" has a string Value, which appears to be the substring
of the data that the Match matched. And it also has a collection of
Captures, each of which has an Index and a length. Not really sure what a
Capture is ...

A Match also has a collection of Groups - each group also has a Success
flag, a Value, an Index and a length.

But wait - a Group also has a Captures collection ...

So I'm a bit confused ... I've read through the docs, but it's like
walking in circles because they all seem to be defined in terms of the
others.

I'm sure I'll do fine once I get the basic terminology and concepts. If
you know of a decent primer, that might help.

Again, thanks for your help and tutelage

Cheers,

Emby
 
J

Jon Shemitz

Emby said:
I mean, you allocate a RegEx with its RE and options, and you can ask the
RegEx for its "Matches", each of which has a Success property. Um, if it
wasn't "successful", then why is it a match ??

When you get a MatchCollection, each Match was successful. But you can
just call Match to get the first Match; when the Regex doesn't match
at all, the Match.Success property is false. Similarly, you can always
ask a Match instance for the NextMatch. That new Match may or may not
be valid. (If a string contains two matches, the first Match will be a
Success. That Match's NextMatch will also be a Success. But that
second Match's NextMatch will NOT be a Success.)
I'm sure I'll do fine once I get the basic terminology and concepts. If you
know of a decent primer, that might help.

Try my Strings and Files chapter at
<http://www.midnightbeach.com/.net/PDFs.html>.

--

..NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
K

Kevin Spencer

My pleasure Emby, and I made no inferences as to your prior experience with
regards to your inexperience with Regular Expressions. Regular Expressions
is a (highly-specialized) language unto itself, and is a relative newcomer
to the field. It is only in the past couple of years that I have come to
learn and understand them myself! I'm just passing on the benefit that I
have received from others and my own study. Your quick understanding of my
technical explanations is evidence of your own experience!

BTW, C was also my first language, and my progression is similar to your
own!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Emby said:
Hi Kevin,

Thanks so much for taking the time to explain the basics; I really
appreciate your effort.

I'm sure it doesn't seem it, but I'm a 15 year veteran software engineer
(back then we were just called programmers), with deep experience in C,
VB, and now C# and VB.NET. But I'm a absolute newbie to regular
expressions (obviously). They are very interesting and seem powerful for
solving text parsing and manipulation problems.

Thanks again for your patience and clarity. Same to you Jon. You guys
rock!

Cheers,

Emby
"My other computer is an IBM 360 series ... "

Kevin Spencer said:
Hi Emby,

It's not really as complicated as it seems. Some of your confusion arises
from the fact that Match inherits Group, and therefore has a few members
that are basically irrelevant, like Success.

From the beginning: A Regex class defines a pattern-matching string that
is used against a target string. Notice the word "matching" in there. Any
string which "matches" against the pattern is a match. There can be
anywhere from 0 on up of these Matches. The Match class is a match
against the entire Regular Expression. For example:

"This is a target string"

Simple Regular Expression: "\w+" - Matches 1 or more (consecutive) "word"
characters (digits and/or alpha characters).

This yields 5 Matches: "This" "is" "a" "target" and "string"

Each Match matches a portion of the target string against the entire
Regular Expression.

The following target string would yield 0 Matches: "...................
..................."

Grouping is a way of subdividing the Matches into "Groups" of
sub-matches. When you create Groups, there will always be Groups in the
result, one per Group defined in the Regular Expression, much as there is
always a MatchCollection, though it may be empty. For example:

This is a Regular Expression for finding Hyperlink HREFs in HTML:

(?i)<(a)[^>]*href=[^>]*[>]?(.*?)(?:</\1|/)>?

When used against the following target string (from the Google Home
Page), it yields 2 matches, because there are 2 HREFs in the text:

<a href=/services/>Business Solutions</a> - <a href=/intl/en/about.html>
About Google</a><span id=hp
style="behavior:url(#default#homepage)"></span>

These are the 2 matches:

1. Matches[0].Value = <a href=/services/>Business Solutions</a>

2. Matches[1].Value = <a href=/intl/en/about.html>
About Google</a>

However, you may not want the entire link, but only the InnerHtml of the
link (the part between the tags). So, the Regular Expression has 2 Groups
defined, by enclosing portions of it in parentheses:

(a) matches the "anchor" (a) in the HREF
(.*?) matches everything between the tags.

Note: the other parenthetical expression - (?:</\1|/) is not a capturing
Group (doesn't appear in the Groups Collection). Non-capturing Groups are
often used to apply a set of rules only to a portion of the Regular
Expression. Non-capturing Groups begin with (?:

Now, when we apply these Groups to the Matches, we get the following
Values in the Matches:

1. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = Business Solutions

2. Group 1 - Groups[0].Value = a
Group 2 - Groups[1].Value = About Google

This should simply illustrate how a Match can contain multiple Groups.
Any of them can have a Success of false, if the value is empty. In a case
like yours, for example, where you're ORing patterns together, you might
have put each of the ORed patterns into a Group. In that case, let's say
that each Match only matched one Group. The other Groups would be empty.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Emby said:
Thank you gentlemen

I have more learning to do before I really understand what's going on
here. I mean, I can see how each part of the RE works, your collective
explanations of each part was excellent. But I am not yet conversant
with the some terms; I am a bit confused by the names of the .NET
classes Match, Group and Capture.

I mean, you allocate a RegEx with its RE and options, and you can ask
the RegEx for its "Matches", each of which has a Success property. Um,
if it wasn't "successful", then why is it a match ??

OK - each "Match" has a string Value, which appears to be the substring
of the data that the Match matched. And it also has a collection of
Captures, each of which has an Index and a length. Not really sure what
a Capture is ...

A Match also has a collection of Groups - each group also has a Success
flag, a Value, an Index and a length.

But wait - a Group also has a Captures collection ...

So I'm a bit confused ... I've read through the docs, but it's like
walking in circles because they all seem to be defined in terms of the
others.

I'm sure I'll do fine once I get the basic terminology and concepts. If
you know of a decent primer, that might help.

Again, thanks for your help and tutelage

Cheers,

Emby

Kevin Spencer wrote:

In essense, it says, "Match the first (comment) group first. With what
is
left over, match either the quoted strings, or the left-over tokens,
and put
the left-over tokens into a group." You can do a regular expression
match,
and use the values in Group 1 to do your replacements.

Very clever - I'll have to remember this approach. It's much simpler
to exclude by consuming than to fart around with zero-width assertions
the way I did.

Emby, do note that each Group has a Success property, just like the
Match (a Group descendant, fwiw) does.

--

.NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
K

Kevin Spencer

Hi Jon,

I must say that your reference on Regular Expressions in .Net is one of the
best, well-orgainized, and comprehensive that I have seen. I would recommend
it to anyone, and I hope your book does well!

BTW, I usually use the RegexBuddy application (http://www.regexbuddy.com/)
to build and test my regular expressions. Have you used it before? It is not
free, but is only about 30 dollars to buy, and IMHO, well worth the
investment. I have also used it to train myself with regular expressions, as
it has some great tools for building and testing. In addition, it has a very
nice reference manual, and is compatible with almost every "flavor" or
regular expressions, including .Net. You can also create your own library
with it.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 
J

Jon Shemitz

Kevin said:
I must say that your reference on Regular Expressions in .Net is one of the
best, well-orgainized, and comprehensive that I have seen. I would recommend
it to anyone, and I hope your book does well!

Thank you. The book's been out a full week now: it had a nice surge on
the announcement that it was finally available; now it has to build
some word of mouth as people read it and start commenting on it.
BTW, I usually use the RegexBuddy application (http://www.regexbuddy.com/)
to build and test my regular expressions. Have you used it before? It is not
free, but is only about 30 dollars to buy, and IMHO, well worth the
investment. I have also used it to train myself with regular expressions, as
it has some great tools for building and testing. In addition, it has a very
nice reference manual, and is compatible with almost every "flavor" or
regular expressions, including .Net. You can also create your own library
with it.

I use the Regex Explorer app that I built for the book ....

--

..NET 2.0 for Delphi Programmers <http://www.midnightbeach.com/.net>

Delphi skills make .NET easy to learn
Just printed, and shipping now.
 
K

Kevin Spencer

I use the Regex Explorer app that I built for the book ....

Well Jon, now I'll have to buy the book to try it out!

--

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 
K

Kevin Spencer

Thanks Jon. I'll check out the Regex Explorer first, and add the book to my
list of books I need to buy and read!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top