Extracting Links from a Document

John Seeliger · Mar 7, 2005

I am pretty new to VB, so please forgive the simplistic question. This is
using VB .NET Standard 2003.

My form has three objects on it: a TextBox named URL, a Button named Extract
and a WebBrowser named AxWebBrowser1. The goal is to have the user enter a
URL in the TextBox and then hit the Extract button and then to get the links
from the web page they entered.

So far I have:

AxWebBrowser1.Navigate(URL.Text)

which does display the page in the browser frame (not that I need to see it.
I am doing this to parse the data, so it is not necessary to display the
pages.)

and then:

Dim doc As mshtml.HTMLDocument = _
AxWebBrowser1.Document()

which I had gotten off some site which I assume creates a document object
named doc. Now, how do I extract the links and convert them to strings so
that I can parse them, looking for the keywords I am trying to find?

Thanks
-John

Guest · Mar 7, 2005

I wrote some code to do pretty much exactly that a while back. First, you
don't need ot use Internet Explorer if you don't want to. The Microsoft .NET
Framework has classes that make it easy to download a page.

Here's the code. Just create a new Console application, and paste this in:

Option Compare Text
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
SiteSweep("http://www.asp.net/whidbey/pdc.aspx?tabindex=0&tabid=1",
"c:\PDC")

SiteSweep("http://msdn.microsoft.com/events/pdc/agendaandsessions/sessions/default.aspx", "c:\PDC")
Console.WriteLine("Done")
Console.ReadLine()
End Sub
Public Sub SiteSweep(ByVal source As String, ByVal dest As String)
' needed to deal with relative paths
Dim root As String = Left(source, source.IndexOf("/", 7))
Dim current As String = Left(source, source.LastIndexOf("/") + 1)
' pull page
Dim w As New WebClient
Dim sr As New StreamReader(w.OpenRead(source))
Dim s As String = sr.ReadToEnd()
' find hrefs
Dim r As New Regex("href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>S+))", _
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
' get rid of dups
Dim d As New Hashtable
For Each m As Match In r.Matches(s)
Dim url As String = m.Groups(1).Value
' find only certain file types. This could have been done with
the
' previous regex, except (1) I ripped that regex off of MSDN,
and (2)
' I plan on running the app all of one time, so who cares.
If Right(url, 4) = ".ppt" Or Right(url, 4) = ".zip" Or
Right(url, 4) = ".doc" Then
If Left(url, 7) <> "http://" Then
If url.StartsWith("/") Then
url = root & url
Else
url = current & url
End If
End If
d(url) = Right(url, Len(url) - url.LastIndexOf("/") - 1)
End If
Next
If Not Directory.Exists(dest) Then
Directory.CreateDirectory(dest)
End If
' download each file. If the download bombs, try again, unless you
get
' a 415 or 404 because there appears to be a problem with one some
of the
' files, or they are hrefs that are commented out, and my regex
ain't smart
' enough to figure that out.
For Each s In d.Keys
Dim isDownloaded As Boolean = False
While Not isDownloaded
Try
Console.WriteLine("Downloading:" & s)
If Not File.Exists(dest & "\" & d(s)) Then
w.DownloadFile(s, dest & "\" & d(s))
End If
isDownloaded = True
Catch exc As Exception
Console.WriteLine(exc.Message)
If exc.Message.IndexOf("(415)") >= 0 Or
exc.Message.IndexOf("(404)") Then
isDownloaded = True
End If
End Try
End While
Next
End Sub
End Module

Scott Swigart
www.swigartconsulting.com
blog.swigartconsulting.com

AxWebBrowser1.Document type question	1	Dec 28, 2003
mshtml & webbrowser VB.NET	0	Dec 4, 2003
Saving a Word Document within a WebBrowser Control	2	Dec 15, 2006
View Source from WebBrowser (2005)	7	Feb 28, 2006
doc.loadxml(axwebbrowser1.document.body.outerhtml), doesn't work?	26	Oct 20, 2003
Code to Extract Text from PDF	3	Jul 2, 2008
Extract specific information from the body of outlook mail to an Excel File using VBA	0	Apr 12, 2017
AxWebBrowser - problem with caching website.	29	Dec 23, 2003

Extracting Links from a Document

John Seeliger

Guest

Ask a Question

Similar Threads