Easy (?!) regular expression -- find line breaks

P

Peter Duniho

So, I'm trying to learn how the Regex class works, and I've been trying to
use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem to
come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't terminate
in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if no
"\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).

I also tried using single-line mode, trying to work around the problem in
the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last match
without the \r\n pair at the end of the string".

Is this actually impossible using Regex, or is there some combination of
options that will allow me to match the first \r\n pair without requiring
a \r\n pair at the end of the last match?

Thanks,
Pete
 
P

Peter Duniho

[...]
If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

And just to clarify...

Yes, I understand that I can just use String.Split() to do this. I'm
talking about the more general question of the matching, and my little
self-assigned homework exercise to try to learn how Regex works.
 
G

Guest

Peter said:
So, I'm trying to learn how the Regex class works, and I've been trying
to use it to do what I think ought to be simple things. Except I can't
figure out how to do everything I want. :(

If I want to take a string and break it into individual lines based on a
specific pattern ("\r\n" in this case, but I don't think it matters), I
can easily write a loop that does this by scanning through the string
accumulating characters and spitting out a new string each time it hits
the "\r\n". But I figured Regex ought to be able to do the scanning for
me, so that all I have to loop through are the matches.

I've tried a wide variety of expression strings, but the ones that seem
to come closest to what I want are:

"(.+)\r\n" -- works great, except that if the string doesn't
terminate in a "\r\n", the last line isn't matched

"(.+)(\r\n)*" -- the idea being to allow the last line to match if
no "\r\n" is found. works great, except that the "\r" winds up getting
captured as well (presumably because the second capture group is just
ignored and everything up to the "\n" gets captured by the first capture
group because the default is to )

"(.+?)(\r\n)*" -- works great, except that it's _too_ lazy, and
happily matches just a single character at a time

(Note: I'm using a replacement string specifying the first capture group
so that I can toss out the "\r\n", but if there's a way to match the
"\r\n" without it winding up in the match itself while at the same time
preventing it from being included in the subsequent match attempt, that
would be wonderful).

Use a non-catching group: (?:\r\n)
I also tried using single-line mode, trying to work around the problem
in the second example, but when I do that, the expression happily and
greedily captures _everything_ up to the very last "\r\n".

What I'm looking for is the expression that represents "capture all text
up to the first \r\n pair, allowing for the possibility of one last
match without the \r\n pair at the end of the string".

Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
 
J

Jesse Houwing

* Peter Duniho wrote, On 14-6-2007 19:25:
Ah. So simple. Thanks!


Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

It's probably faster as well.

Jesse
 
P

Peter Duniho

Even easier would be to set the RegexOption.Multiline on and look for
the following: "^.*$" This should match on every beginning of a line
(^), fetch the content (.*) and end on the end of each line ($).

Except that as near as I can tell, Regex only uses Unix-style linebreaks.
That is, \n by itself. Which means that if I use the Multiline option
(which seems to be the default, actually), I wind up with the \r as part
of my matched strings, which I don't want.

Pete
 
J

Jesse Houwing

* Peter Duniho wrote, On 14-6-2007 20:39:
Except that as near as I can tell, Regex only uses Unix-style
linebreaks. That is, \n by itself. Which means that if I use the
Multiline option (which seems to be the default, actually), I wind up
with the \r as part of my matched strings, which I don't want.

This shouldn't be so, but does seem to be the case in .NET 2.0. I've
file a bug against it and it should be fixed in framework Orcas. It
hasn't been this way in .NET 1.0 and 1.1 as far as I can remember.

^.*?\r?^ should fix it in the mean while, but is probably slower.

Please file a bug against this to get it fixed in the next service pack
of .net 2.0 if you want to see this fixed there. I tried, but they keep
closing the bug with the message that they cannot reproduce in orcas,
which is still far away for quite some of our customers.

Jesse
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top