Extract HTML + Reg Ex

Ori · Feb 11, 2004

Hi,

I have a HTML text which I need to parse in order to extract data from
it.

My html contain a table contains few rows and two columns. I want to
extract the data from the 2nd column in the most efficient way (using
Reg Ex.) either than using the "indexOf" function of String.

Thanks,

Ori.

Here is the HTML table:

<table BORDER="1" CELLSPACING="0" CELLPADDING="1">
<tr>
<td>Licensee Name</td>
<td BGCOLOR="#ffffcc">JOHN Doo</td>
</tr>
<tr>
<td><a HREF=>Primary Status</a></td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
<tr>
<td>License Number</td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
<tr>
<td><a >License Type</a></td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
<tr>
<td>Header</td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
<tr>
<td>Address</td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
<tr>
<td>City State State Zip </td>
<td BGCOLOR="#ffffcc">Data_To_Be_Extracted</td>
</tr>
</table>

Matthias Kwiedor · Feb 11, 2004

Hi!

Try this:

// First split the HTML into Table Lines
string[] arrLines = Regex.Split(strContent, @"<tr.*?>",
RegexOptions.IgnoreCase);

// Go through each line
forearch (string strLine in arrLines)
{
// Split into Rows Array
string[] strCol = Regex.Split(strLine, @"<td.*?>",
RegexOptions.IgnoreCase);
// Remove HTML Tags?
strCol[1] = Regex.Replace(strCol[1], @"<[^>]*>", "");
// second Column
MessageBox.Show(strCol[1]);
}

Hope thats what you want!

Greetings

Matthias

(e-mail address removed) (Ori) wrote in @posting.google.com:

Extract HTML + Reg Ex

Ori

Matthias Kwiedor