how do I handle linebreaks in Regex?

Guest · Aug 22, 2006

I'm trying to do some regex in C# but for some reason linebreaks are causing
my regex to not work.

the test string goes like this:

string ss = "<tagname
something=45678&somethingelse=12345>blah</tagname>\r\n<tag2>stuff</tag2>";

and my regex code is like:

Regex pat = new Regex("something=([0-9]*).*>(.*)<.*<tag2>(.*)</tag2>",
RegexOptions.Multiline);

foreach (Match m in pat.Matches(ss)) {

foreach (Group g in m.Groups) {
Console.Write(g+", ");
}
Console.WriteLine();
}

but it never works unless I remove the "\r\n" from the test string.

how do I get around that? I thought that's what the RegexOptions.Multiline
was supposed to take care of?

Guest · Aug 22, 2006

Oops, looks like I should have used "Singleline" option, not Multiline!

Guest · Aug 22, 2006

Ok there's definitely something I am not understanding with Regular
Expression...

when I switched to Singleline mode it worked great for that small test
string but in the main file all hell broke loose and my Console seems to be
printing out the entire file over and over in some infinite loop.

As another test I tried this:

Regex pat = new Regex("<a.*>(.*)</a>", RegexOptions.Singleline);

on a regular web page, and again same problem... it just starts spitting out
the whole file over and over. If I switch to Multiline mode it works great
but it does not pick up and <a> tags which is broken up by one or more
newlines... how do I get around this?

Alan Pretre · Aug 22, 2006

MrNobody said:
Ok there's definitely something I am not understanding with Regular
Expression...

when I switched to Singleline mode it worked great for that small test
string but in the main file all hell broke loose and my Console seems to
be
printing out the entire file over and over in some infinite loop.

As another test I tried this:

Regex pat = new Regex("<a.*>(.*)</a>", RegexOptions.Singleline);

on a regular web page, and again same problem... it just starts spitting
out
the whole file over and over. If I switch to Multiline mode it works great
but it does not pick up and <a> tags which is broken up by one or more
newlines... how do I get around this?

Instead of . use [\S\s]

-- Alan

Kevin Spencer · Aug 22, 2006

Here's the problem:

<a.*>(.*)</a>

You're using the '.' character escape. This means "any character that is not
a newline character". When the MultiLine option is on, the '.' matches
newlines, and therefore matches everything. So, when you have MultiLine off,
the newline character sequence breaks the match. When you have it turned on,
everything is matched in the first match until the end of the last match.
Here:

<a
something=45678&somethingelse=12345>blah</a>
<a>stuff</a>

With MultiLine ON, the first tag matches, and so does every character after
it, until the last "</a>" in the string.

Using the '.' character escape is to Regular Expressions what using an
Atomic Bomb is to warfare. You want to be as specific as possible, rather
than the opposite.

In the example below, I use a very specific character class: [^<] - This
means any character that is NOT a '<' (or a '>' in another case). This way,
the match stops where the first '<' character is found, and the rest of the
match is evaluated from the remaining portion of the string. The following
will find the 2 (and only 2) matches in your example. Each matching value
will be in Group 1 of the match.

<a[^>]*>([^<]*)</a>

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.

Guest · Aug 22, 2006

Kevin, thanks for that tip, it works great for that example!

So there is no way in regex to say something like, accept all characters
until you hit a specific group of characters, like "</div>" ? Like let's say
you are scanning a web page for a specific opening <div> tag, and you want to
grab all the text between that and the next closing </div> tag, so the
contents may include many <'s and >'s inside. I guess regex is not the way to
go for doing something like that?

Kevin Spencer · Aug 23, 2006

Hi Mr. Nobody,

Actually, Regex is quite capable of handling this sort of situation. In the
solution I gave you I went for the simplest solution necessary, as I
understood it. Your second example was using an image tag, which would not
contain other tags. In a case where other tags might be nested, you would
need to use a different set of Regex tools.

For example, to get all text between 2 matching beginning and ending tags,
when there are no nested tags, you would use something like:

<([^>]*)>([^<]*)</\1>

This indicates that a match begins with the left angle bracket. The left
angle bracket is followed by a sequence of any length of characters that are
NOT a right angle bracket. This sequence of characters is put into Group 1.
This is followed by a sequence of any length that is NOT a left angle
bracket, followed by a left angel bracket and a forward-slash. The last part
of the match is that the text from the first tag (Group 1) is matched,
followed by a right angle bracket.

For tags that contain nested tags, something like the following might work:

<(table|form|div)[^>]*>(.*?)</\1>

This indicates that tables, forms, and divs (I'm sure I may have missed one
or two) are matched. The ending tag uses the group captured from the first
tag. Group 2 contains the content.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.

Guest · Aug 23, 2006

Awesome!

Thanks Kevin, you were a BIG help- thanks to you I was able to do what I
needed without resorting to trying and parse the HTML response which would
have been a nightmare and been really slow compared to regex!

I needed to write a program which basically monitors forum responses by
stripping out key strings like the post itself, it's thread ID and the user's
name and then performing some checks on them.

Thanks again for your help

Can I ignore linebreaks in Regex Singleline mode?	1	Dec 31, 2006
Regex: Capturing HTML	1	Oct 11, 2005
How do know the name of the capture in Regex	4	Mar 11, 2004
Unwanted Escape Codes In String...	10	Aug 2, 2003
RegEx Replace	1	Jan 3, 2005
more regex question how to avoid capturing leading empty lines	2	Aug 9, 2007
Regex String searching for quotes	3	Jul 11, 2007
Regex: replacing \n and spaces	4	Jan 5, 2007

how do I handle linebreaks in Regex?

Guest

Guest

Guest

Alan Pretre

Kevin Spencer

Guest

Kevin Spencer

Guest

Ask a Question

Similar Threads