regex pattern - ignore whitespace (CRLF and spaces)?

C

Craig Buchanan

I have a HTML fragment that looks like this:

<tr>
<td valign="top" nowrap><span class="textBold">Property
ID: </span></td>
<td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0"><b>&nbsp;01-068-24-64-1024</b></td>
</tr>

I am trying to extract the '' part of it.

This pattern works:

Property \s\ *ID: </span></td>\s\ *<td .*><b>&nbsp;(.*)</b>

I would like to simplify the pattern so that it will ignore new line
characters, &nbsp; and >1 contiguous space. Ideally, the HTML would look
like:

<tr><td valign="top" nowrap><span class="textBold">Property
ID:</span></td><td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0"><b>01-068-24-64-1024</b></td></tr>

Does anyone have a suggestion on this?

Thanks a lot,

Craig Buchanan
 
L

Larry Lard

Craig said:
I have a HTML fragment that looks like this: [snip]
Does anyone have a suggestion on this?

My suggestion is to use HTMLAgilityPack, not Regex, for parsing HTML.
 
C

Cerebrus

Hi Craig,

Part of your question remains unclear to me, so I can only work with
assumptions :

1. Which HTML sample do you want to match ? The first or the second or
both ? I will assume both.
2. What part do you want to extract ? I think you missed out that part.
From your Regex, you apparently want to match the part within the
<b>...</b> tags. That is what I will assume.

Here is a Regex to match : (Turn on "Dot matches Newline mode" for it
to work)

Property\s+ID:\s*</span></td>\s*<td.*?><b>(?:[&nbsp;]*)(.*)</b></td>

Points to note :
----------------------
1. Instead of \s\ * , I have used : \s+, in cases where there will be
atleast one space, and \s* where there might be zero or more spaces.
This can be changed to \s* in all cases. This matches all spaces, tabs
and line breaks.
2. If your lines break in an unanticipated position, the Regex will not
match.
3. In order to match zero or more &nbsp; special entities, I have used
(?:[&nbsp;]*). This will not store the entity in a backreference.
4. If you're using .NET, you can turn on "Dot matches newline" mode
using the RegexOptions.SingleLine option.
5. Regexes can only match very specific strings. Usually you can relax
it a bit for spaces and line breaks, but not for other characters. For
instance, if an &nbsp; is inserted anywhere else in the string, except
for within the <b>...</b> tags, the Regex will not match. So, if you're
expecting very diverse HTML fragments, you would be better off with
Larry's suggestion of using HTMLAgilityPack. It can be downloaded from
:

http://www.codefluent.com/smourier/download/htmlagilitypack.zip

HTH,

Regards,

Cerebrus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top