Problem with a Regex

T

taylorjonl

I am having a problem matching some text. It is a very simple pattern
but it doesn't seem to work. Here goes.

<td[^>]*>.*?</td>

That is the pattern, it should match any <td></td> pair. Here is my
input data.

<td valign="top">Buyer<a href="http://www.google.com">google</a><img
src="www.google.com/s.gif" width="4" border="0">(<a
href="www.google.com">9</a> )<span> </span></td>
<td valign="top">
Buyer



<a href="http://www.google.com">google</a><img
src="www.google.com/s.gif" width="4" border="0">
(
<a href="www.google.com">9</a> )<span> </span></td>

The first and second are exactly the same but the first has the spaces
removed. The pattern will match the first but not match the second. I
am very confused.

I have ran some tests. This pattern will match the first but not the
second.

<td[^>]*>.*?Buyer

This will match both of them.

<td[^>]*>\s*?Buyer

This indicates to me that the '.' is not matching a space character.
Any ideas?
 
J

Jon Skeet [C# MVP]

taylorjonl said:
I am having a problem matching some text. It is a very simple pattern
but it doesn't seem to work. Here goes.

<td[^>]*>.*?</td>

That is the pattern, it should match any <td></td> pair.

Just out of interest, what are you expecting the '?' to do? Usually it
comes after a different character that you want to match 0 or 1 times -
but in this case you don't have a previous character (the .* is the
previous bit).

I'm far from an expert on regexes, but I don't understand what that '?'
will actually match.
It may be part of the problem.

Jon
 
T

taylorjonl

The ? after the * tells the regex to be non-greedy. Normally it is
greedy so if we had the input of

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

<td[^>]*>.*</td>

would match

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

Because well, it is greedy and will make the largest match. Adding the
? tells it to be non-greedy.
 
S

Simon Dahlbacka

Isn't this just due to the dot NOT matching newlines by default (while
\n is included in \s)

[MSDN about dot:]
"Matches any character except \n. If modified by the Singleline option,
a period character matches any character. For more information, see
Regular Expression Options."
 
C

Cerebrus

Hi taylorjonl and Jon Skeet,

I have a few points to make here :

1. Analyzing the sample string you gave and the 1st Regex pattern
(<td[^>]*>.*?</td>), I realized that it matches perfectly. It is what
you need. The only thing that you need to do now, is enable the Regex
option to allow the dot "." to match a Newline character. This equates
to "Dot matches Newline" in other Regex flavours and
RegexOptions.SingleLine in .NET.

I don't know which Regex validator you're using to run your tests, but
just try it with that option enabled, and it will definitely work.

2. The ".*?" - This has a special meaning. ".*" alone means "Match any
character any no. of times, as many times as possible (Greedily)" and
".*?" means "Match any character any no. of times, but as few times as
possible (Lazily)".
The difference between a Greedy and a Lazy match is that the former
will match as many occurrences as possible, while the second will match
as few as possible. The latter will give you the shortest match.
I usually use (.*?) to match anything between any other text.
If you simply used the Regex pattern "<td(.*?)</td>" it would still
solve taylorjonl's problem. It just means match anything that comes
between 2 <td>'s. (including spaces, newlines and what not !)

3. I think the important point in deciding any Regex Pattern is what
you want to retrieve from it. (what will be stored in the back
reference). For instance, in your sample string, what exactly do you
intend to retrieve ? Whatever it is should be in brackets.

Assuming it's the "Buyer" part, use this Regex pattern (Remember to set
the RegexOptions.Newline flag option)

<td[^>]*>(.*?)<a.*?</td>

Try a replace action with the Regex pattern "$1" (.NET notation), and
you will have found some Buyers !!! ;-)

Hope this helps,

Regards,

Cerebrus.
 
T

taylorjonl

You are the bomb, this has been driving me nuts trying to figure it out
and I know it had to be something simple. If .NET would only behave
like the rest of the world when it comes to regular expressions.

Thanks again, works like a charm now.
 
C

Cerebrus

Well, you know... .NET is... kinda Exceptional !!! ;-)

BTW, what part of that sample string did you want to retrieve ?

Regards,

Cerebrus.
 
T

taylorjonl

That string is just a test string I used. I am actually going to be
extracting certain pieces of information from an eBay feedback page. I
have since the last post came up with the following do to this so far.

using System.Text.RegularExpressions;

Regex regex = new Regex(
@"<tr[^>]*>[^<]*<td></td>[^<]*<td[^>]*>.*?alt=""(?<type>[^"""
+
@"]+)""></td>[^<]*<td></td>[^<]*<td[^>]*>(?<message>[^<]*)"
+
@"<br></td>[^<]*<td></td>[^<]*<td[^>]*>.*?</td>[^<]*<td>"
+
@"</td>[^<]*<td[^>]*>(?<date>.*?)</td>[^<]*<td></td>[^<]*
"
+ @"
<td[^>]*>[^>]*>(?<item>\d{10})</a></td>[^<]*<td></td>[^<"
+ @"]*</tr>",
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.Singleline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);

That will get my all the importan sections that I can reference by
name.

I am using a program called Expresso which is wonderful for testing
these out.

Thanks for the help.
 
J

Jon Skeet [C# MVP]

taylorjonl said:
The ? after the * tells the regex to be non-greedy. Normally it is
greedy so if we had the input of

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

<td[^>]*>.*</td>

would match

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

Because well, it is greedy and will make the largest match. Adding the
? tells it to be non-greedy.

Aha - great, thanks for that. There's always more to know about
regexes...

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top