Need help with a Regular Expression

L

Luhar

After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.
 
K

Ken Tucker [MVP]

Hi,


Maybe this will help.


Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))

Dim strHtml As String



Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()



Ken

----------------------------

After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.
 
L

Luhar

Thanks Ken. I'll try that.

While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:


<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.*?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.

If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.

Thanks again,

Luhar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top