Need help with a Regular Expression

  • Thread starter Thread starter Luhar
  • Start date Start date
L

Luhar

After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.
 
Hi,


Maybe this will help.


Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))

Dim strHtml As String



Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()



Ken

----------------------------

After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.
 
Thanks Ken. I'll try that.

While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:


<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.*?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.

If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.

Thanks again,

Luhar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top