A challenge: regex to convert all urls within HTML

  • Thread starter Thread starter Vlad
  • Start date Start date
V

Vlad

Hi!

My task: Take HTML -> convert into plain text.

Sub-task:
1. Find all urls within HTML (<a href="http://www.abc.com">More
about baby bears</a>).
2. And convert them into plain text: More about baby brears (http://
www.abc.com)


The question:
Can it be done with a single regex (i.e. single pass)? Or what would
be otherwise the most efficient way of doing it?



Thank you very much for your time!
 
I don't claim it would be better or worse - but if the source is
xhtml, an alternative might be xslt? But it can be hard to write tidy
xslt that correctly handles mixed content (which is typical in xhml).

Just a thought...

But yes - I would *imagine* that you can do this with a regex replace,
but handling all permutations of attributes / sequence etc could be a
pain.

Marc
 
(?i)(?s)<a[^>]+?href="?(?<url>[^"]+)"?>(?<innerHtml>.+?)</a\s*>

Group "url" contains URL.
Group "innerHtml" contains innerHtml - the text between the tags.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
Back
Top