A year ago I wrote a spider that crawls around
http://www.wtng.info and
extracts all the country codes, country names, and telephone numbering
information. I haven't looked at it since then, but I think this is the
utility function I made that strips all HTML tags from a string leaving only
the text:
Use it like this:
'...get your data first
Dim htmlText As String = 'wherever you got your html text
' This will keep extra empty lines
Dim textOnly As String = stripHTML(htmlText, True)
' This will remove [most] extra empty lines
Dim textOnly As String = stripHTML(htmlText, False)
For this function to compile, you must Import
System.Text.RegularExpressions.
Private Function stripHTML( _
ByVal html As String, _
ByVal keepCRLF As Boolean _
) As String
' isolates a value between HTML tags and other control chars
Dim rxDrop As New Regex("(\<[^\>]+)\>(\r\n)*")
Dim rxKeep As New Regex("(\<[^\>]+)\>")
If keepCRLF Then
Return rxKeep.Replace(html, "").Trim()
Else
Return rxDrop.Replace(html, "").Trim()
End If
End Function