Extracting Links from a Document


John Seeliger

I am pretty new to VB, so please forgive the simplistic question. This is
using VB .NET Standard 2003.

My form has three objects on it: a TextBox named URL, a Button named Extract
and a WebBrowser named AxWebBrowser1. The goal is to have the user enter a
URL in the TextBox and then hit the Extract button and then to get the links
from the web page they entered.

So far I have:


which does display the page in the browser frame (not that I need to see it.
I am doing this to parse the data, so it is not necessary to display the

and then:

Dim doc As mshtml.HTMLDocument = _

which I had gotten off some site which I assume creates a document object
named doc. Now, how do I extract the links and convert them to strings so
that I can parse them, looking for the keywords I am trying to find?



I wrote some code to do pretty much exactly that a while back. First, you
don't need ot use Internet Explorer if you don't want to. The Microsoft .NET
Framework has classes that make it easy to download a page.

Here's the code. Just create a new Console application, and paste this in:

Option Compare Text
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module Module1
Sub Main()

SiteSweep("http://msdn.microsoft.com/events/pdc/agendaandsessions/sessions/default.aspx", "c:\PDC")
End Sub
Public Sub SiteSweep(ByVal source As String, ByVal dest As String)
' needed to deal with relative paths
Dim root As String = Left(source, source.IndexOf("/", 7))
Dim current As String = Left(source, source.LastIndexOf("/") + 1)
' pull page
Dim w As New WebClient
Dim sr As New StreamReader(w.OpenRead(source))
Dim s As String = sr.ReadToEnd()
' find hrefs
Dim r As New Regex("href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>S+))", _
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
' get rid of dups
Dim d As New Hashtable
For Each m As Match In r.Matches(s)
Dim url As String = m.Groups(1).Value
' find only certain file types. This could have been done with
' previous regex, except (1) I ripped that regex off of MSDN,
and (2)
' I plan on running the app all of one time, so who cares.
If Right(url, 4) = ".ppt" Or Right(url, 4) = ".zip" Or
Right(url, 4) = ".doc" Then
If Left(url, 7) <> "http://" Then
If url.StartsWith("/") Then
url = root & url
url = current & url
End If
End If
d(url) = Right(url, Len(url) - url.LastIndexOf("/") - 1)
End If
If Not Directory.Exists(dest) Then
End If
' download each file. If the download bombs, try again, unless you
' a 415 or 404 because there appears to be a problem with one some
of the
' files, or they are hrefs that are commented out, and my regex
ain't smart
' enough to figure that out.
For Each s In d.Keys
Dim isDownloaded As Boolean = False
While Not isDownloaded
Console.WriteLine("Downloading:" & s)
If Not File.Exists(dest & "\" & d(s)) Then
w.DownloadFile(s, dest & "\" & d(s))
End If
isDownloaded = True
Catch exc As Exception
If exc.Message.IndexOf("(415)") >= 0 Or
exc.Message.IndexOf("(404)") Then
isDownloaded = True
End If
End Try
End While
End Sub
End Module

Scott Swigart

