Extract data from web page.

T

Thief_

I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"><img
src="/forums/images/icon_topic_new.gif" width=14 height=14 alt='New Topic'
border=0></td>

<td nowrap class="odd" align="center">&nbsp;</td>

<td nowrap class="odd" align="center">&nbsp;</td>
<td width="85%" class="even" align="left"><font class="new-row"><a
href="topic.asp?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Started 7/15/2005 - pages <a
href="topic.asp?tid=106110">1</a> - last posted by <a
href="profile.asp?action=view&id=Shandy" onmouseover="window.status='Show
the authors profile'; return true;" onmouseout="window.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><font
class="new-row"><a href="profile.asp?action=view&id=DiscoInferno"
onmouseover="window.status='Show the authors profile'; return true;"
onmouseout="window.status=''; return true;">DiscoInf<BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"><font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
.....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source into
my application?
 
K

Ken Tucker [MVP]

Hi,

Here is a start. It uses a regex to extract links.


Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))

Dim strHtml As String



Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Good resource for Regular Expression Examples.

http://www.regexlib.com/DisplayPatterns.aspx?cattabindex=4&categoryId=8

Ken

----------------------------

I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"><img
src="/forums/images/icon_topic_new.gif" width=14 height=14 alt='New Topic'
border=0></td>

<td nowrap class="odd" align="center">&nbsp;</td>

<td nowrap class="odd" align="center">&nbsp;</td>
<td width="85%" class="even" align="left"><font class="new-row"><a
href="topic.asp?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Started 7/15/2005 - pages <a
href="topic.asp?tid=106110">1</a> - last posted by <a
href="profile.asp?action=view&id=Shandy" onmouseover="window.status='Show
the authors profile'; return true;" onmouseout="window.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><font
class="new-row"><a href="profile.asp?action=view&id=DiscoInferno"
onmouseover="window.status='Show the authors profile'; return true;"
onmouseout="window.status=''; return true;">DiscoInf<BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"><font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
.....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source into
my application?
 
H

Herfried K. Wagner [MVP]

Thief_ said:
I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"><img
src="/forums/images/icon_topic_new.gif" width=14 height=14 alt='New
Topic'
border=0></td>

<td nowrap class="odd" align="center">&nbsp;</td>

<td nowrap class="odd" align="center">&nbsp;</td>
<td width="85%" class="even" align="left"><font class="new-row"><a
href="topic.asp?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Started 7/15/2005 - pages <a
href="topic.asp?tid=106110">1</a> - last posted by <a
href="profile.asp?action=view&id=Shandy" onmouseover="window.status='Show
the authors profile'; return true;" onmouseout="window.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><font
class="new-row"><a href="profile.asp?action=view&id=DiscoInferno"
onmouseover="window.status='Show the authors profile'; return true;"
onmouseout="window.status=''; return
true;">DiscoInf<BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"><font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source
into
my application?

Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top