regex pattern - ignore whitespace (CRLF and spaces)?

Craig Buchanan · Mar 27, 2006

I have a HTML fragment that looks like this:

<tr>
<td valign="top" nowrap>Property
ID: </td>
<td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0"> 01-068-24-64-1024</td>
</tr>

I am trying to extract the '' part of it.

This pattern works:

Property \s\ *ID: </td>\s\ *<td .*> (.*)

I would like to simplify the pattern so that it will ignore new line
characters,   and >1 contiguous space. Ideally, the HTML would look
like:

<tr><td valign="top" nowrap>Property
ID:</td><td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0">01-068-24-64-1024</td></tr>

Does anyone have a suggestion on this?

Thanks a lot,

Craig Buchanan

Larry Lard · Mar 27, 2006

Craig said:
I have a HTML fragment that looks like this: [snip]
Does anyone have a suggestion on this?

My suggestion is to use HTMLAgilityPack, not Regex, for parsing HTML.

Cerebrus · Mar 27, 2006

Hi Craig,

Part of your question remains unclear to me, so I can only work with
assumptions :

1. Which HTML sample do you want to match ? The first or the second or
both ? I will assume both.
2. What part do you want to extract ? I think you missed out that part.

From your Regex, you apparently want to match the part within the

... tags. That is what I will assume.

Here is a Regex to match : (Turn on "Dot matches Newline mode" for it
to work)

Property\s+ID:\s*</td>\s*<td.*?>(?:[ ]*)(.*)</td>

Points to note :
----------------------
1. Instead of \s\ * , I have used : \s+, in cases where there will be
atleast one space, and \s* where there might be zero or more spaces.
This can be changed to \s* in all cases. This matches all spaces, tabs
and line breaks.
2. If your lines break in an unanticipated position, the Regex will not
match.
3. In order to match zero or more   special entities, I have used
(?:[ ]*). This will not store the entity in a backreference.
4. If you're using .NET, you can turn on "Dot matches newline" mode
using the RegexOptions.SingleLine option.
5. Regexes can only match very specific strings. Usually you can relax
it a bit for spaces and line breaks, but not for other characters. For
instance, if an   is inserted anywhere else in the string, except
for within the ... tags, the Regex will not match. So, if you're
expecting very diverse HTML fragments, you would be better off with
Larry's suggestion of using HTMLAgilityPack. It can be downloaded from
:

http://www.codefluent.com/smourier/download/htmlagilitypack.zip

HTH,

Regards,

Cerebrus.

Find HTML tags using RegEx.	3	Jul 25, 2005
PP 2007 opening HTML problem	3	Jun 20, 2007
Adding a date to a form confirmation page	3	Jul 29, 2009
Making a table height="100%"	1	Jan 5, 2007
HELP with colspan Bug	4	May 4, 2006
OL2003 Filter on Header Entry (Not Exactly Working)	3	Jul 21, 2005
WebUIValidation.js Error	1	Mar 5, 2004
pages don't work in Firefox?	2	Jan 12, 2010

regex pattern - ignore whitespace (CRLF and spaces)?

Craig Buchanan

Larry Lard

Cerebrus

Ask a Question

Similar Threads