Using regex in html code

Nightcrawler · May 23, 2007

Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <td> tags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>

Jesse Houwing · May 23, 2007

* Nightcrawler wrote, On 23-5-2007 6:59:

Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <td> tags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

You're better off using the HTML Agility Pack.

But it can be done using regex:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

This will give you one group which will hold all the TD's found. I've
written it quite robust, but this isn't the best available
implementation. If the HTML tables are of a well known format, this
would be no problem. If they come from an external source, you might wat
to test more rigorously.

I'll try to explain:
<tr((?!<td).)*
Find every a TR starting tag and capture anything after that till you
find a <td

(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*
snip off the TD tag and capture it's content till you're at a </td. Then
caputure the </td> and any whitespace or newline that might follow.
Repeat till all TD's have been tagged for this row.

((?!</tr).)*</tr[^>"*]*>
Capture everything that follows the last <td>...</td> combination

Executing Regex.Matches will give you a MatchCollection. Each item in
the matchcollection will have 1 Group named "TD". This group has a list
of Captures which will contain all the values captured in this Group name.

Kind Regards,

Jesse Houwing

Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>

Kevin Spencer · May 23, 2007

You will need to split the string in order to do this. It can be done by
using 2 regular expressions, very similar:

(?s)<tr[^>]*>(?<content>.*?)</tr>

Splits the table into a match for each row.

Once you have the array of row strings, you can use:

(?s)<td[^>]*>(?<content>.*?)</td>

Splits the row into a match for each column.

The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:

(?s)<tr[^>]*>.*?(?<row1><td[^>]*>(?<row1content>.*?)</td>).*?(?<row2><td[^>]*>(?<row2content>.*?)</td>).*?</tr>

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

Jesse Houwing · May 23, 2007

The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:

Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing

Jesse Houwing · May 23, 2007

The reason it can't be done in one pass is that you need to create a

match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:
Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing

Kevin Spencer · May 24, 2007

I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

Jesse Houwing said:
The reason it can't be done in one pass is that you need to create a
match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:

Click to expand...

Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing

Jesse Houwing · May 24, 2007

* Kevin Spencer wrote, On 24-5-2007 13:48:

I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.

Kevin,

Thank you

.

Jesse

file download question in C# and asp.net	2	Dec 2, 2008
Extracting HTML from web based query?	1	Jun 15, 2006
Howto Add Data in Repeater and read it in code-behind ?	1	Nov 21, 2007
creating on-the-fly asp:table in the code file	2	Jun 25, 2003
Minimize HTML code on table	6	Jun 9, 2005
row background colour in html	1	Jan 26, 2005
Canonicalize and signing in C#	1	Jan 19, 2005
Runtime Error when cursor is on an html file in Windows Explorer	1	Oct 27, 2003

Using regex in html code

Nightcrawler

Jesse Houwing

Kevin Spencer

Jesse Houwing

Jesse Houwing

Kevin Spencer

Jesse Houwing

Ask a Question

Similar Threads