my head is spinning with regex

D

Daniel Billingsley

First, if MSFT is listening I'll say IMO the MSDN material is sorely lacking
in this area... it's just a whole bunch of information thrown at you and
you're left to yourself as to organizing it in your head. Typical learning
starts with basics and progresses through increasingly complex information -
I think given the inherent confusion-inducing ability of regex that kind of
documentation would be very valuable.

But anyway, I'm trying to write a regex that will parse a line of code from
a .cs file into the code and comments portions if there is a // somewhere in
the line. I realize this may need to be more complex down the road to
handle special occurrences of // other than in a comment (like in a string
literal), but I'm trying to start with the basics. So I have...

Regex regex = new Regex(@"(?<code>.+)//(?<comments>.+)",
RegexOptions.Compiled);
regexMatch = regex.Match(original);
if (regexMatch.Success)
{
code = regexMatch.Result("${code}");
comments = regexMatch.Result("${comments}");
}

This works fine on
// A basic comment line
or
a = b; // code line with comment afterward

but on the line
/// <summary>

I end up with
code = "/"
comment = " <summary>"

I'm not understanding why the "//" in my regex seems to match the "last"
occurrence of the pattern and "skips" the match on the first two slashes of
the three. I thought by definition the first occurrence would be matched.
Indeed, if my original line is "//// something" I end up with "//" and "
something".

Who can clarify this for me? And who can point me to a *good* resource for
regex edumucation?
 
B

Brian Davis

Try this instead:

(?<code>.*?)//(?<comments>.*)


First of all, changing the +s to *s will allow the regex to match even if
there are no characters before/after the "//". Also, adding the "?" after
the code portion will allow it to match the first occurrence of "//" as
opposed to the last. The ".*" is greedy, so it will consume all that it can
and only give up characters as it needs to in order to match the rest of the
expression. Using the ".*?" makes it lazy, so that it matches only what it
must match in order to continue matching the rest of the expression.


Brian Davis
www.knowdotnet.com
 
R

Ray Hsieh (Ray Djajadinata)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I would suggest "Mastering Regular Expressions" Jeffrey Friedl. It's an
excellent, excellent book if you want/need to know about regex.

Daniel Billingsley wrote:

| Who can clarify this for me? And who can point me to a *good*
resource for
| regex edumucation?


- --
Ray Hsieh (Ray Djajadinata) [SCJP, SCWCD]
ray underscore usenet at yahoo dot com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQE/oxQxwEwccQ4rWPgRAktlAJ41KdYmQv3fWckw/oL5vz6i6SON9QCeIDKU
+rfb81IG2WC5ZhDK4Iokpio=
=uKaX
-----END PGP SIGNATURE-----
 
D

Daniel Billingsley

Thanks Brian, the *? did it. My error there seems like a DOH! moment now.

Also, I found a cool tool that let's you experiment with regex realtime
http://www.weitz.de/regex-coach/

It has a "step" capability that let's you watch the regex work it's magic
step by step. Indeed, just like you describe, stepping through my original
..+ expression the first .+ gobbles up the whole input string (greedy) and
then only gives back what is necessary to match the //, starting at the END
of the string (so it can give up as little as possible). Hence it "gave up"
the // at the end of /// and not the beginning. Using the *? it starts at
the beginning of the input string and only eats up characters until it
reaches the // match, which is of course what I wanted. Very interesting to
see this behavior played out step by step.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top