Parsing a webpage

E

Enigma Boy

Hi folks,

I am retrieving a website for a site using httpWebRequest. What I want to
do with the retrieved webpage is list all the hyperlinks in the page. If I
do a simple regex search for <a then I get links that are commented out in
code and I don't want that. I want links that are actually active. This is
to do with reciprocal link check.

Can someone please point me in the right direction.

Thanks.

--
<a href="http://1pakistangifts.com">Send Gifts to Pakisan at #Pakistan Gifts
Store</a> | <a href="http://dotspecialists.com">Leading Software offshoring
and outsourcing service provider</a> | <a
href="http://websitedesignersrus.com">Professional Websites at affordable
prices</a>
 
J

Jesse Houwing

Hello Enigma,
Hi folks,

I am retrieving a website for a site using httpWebRequest. What I
want to do with the retrieved webpage is list all the hyperlinks in
the page. If I do a simple regex search for <a then I get links that
are commented out in code and I don't want that. I want links that
are actually active. This is to do with reciprocal link check.

Can someone please point me in the right direction.

Thanks.

Have a look at the HTML agility pack. It allows you to parse HTML as it were
XML.

http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top