A challenge: regex to convert all urls within HTML

V

Vlad

Hi!

My task: Take HTML -> convert into plain text.

Sub-task:
1. Find all urls within HTML (<a href="http://www.abc.com">More
about baby bears</a>).
2. And convert them into plain text: More about baby brears (http://
www.abc.com)


The question:
Can it be done with a single regex (i.e. single pass)? Or what would
be otherwise the most efficient way of doing it?



Thank you very much for your time!
 
M

Marc Gravell

I don't claim it would be better or worse - but if the source is
xhtml, an alternative might be xslt? But it can be hard to write tidy
xslt that correctly handles mixed content (which is typical in xhml).

Just a thought...

But yes - I would *imagine* that you can do this with a regex replace,
but handling all permutations of attributes / sequence etc could be a
pain.

Marc
 
K

Kevin Spencer

(?i)(?s)<a[^>]+?href="?(?<url>[^"]+)"?>(?<innerHtml>.+?)</a\s*>

Group "url" contains URL.
Group "innerHtml" contains innerHtml - the text between the tags.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top