Regular expression

P

prithvis.mohanty

I need to extract all the href urls and the anchor text with regular
expression match from a html page. I have this
href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+)) regex with me. This only
extracts the href link url . With url I need the anchor text as well.
How can I modify the above regex to get anchor text as well.
 
G

Greg Bacon

: I need to extract all the href urls and the anchor text with regular
: expression match from a html page. I have this
: href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+)) regex with me. This only
: extracts the href link url . With url I need the anchor text as well.
: How can I modify the above regex to get anchor text as well.

Although it will likely invite fiery criticism from the anti-regex
Luddites, RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html) gives a
regular expression for decomposing URIs:

static void Main(string[] args)
{
Regex uri = new Regex(
@"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?");

string[] tests = new string[]
{
"http://www.w3.org/",
"http://www.ics.uci.edu/pub/ietf/uri/#Related",
};

foreach (string test in tests)
{
Match m = uri.Match(test);

if (m.Success)
Console.WriteLine("match (" + m.Groups[9] + ")");
else
Console.WriteLine("no match");
}
}

Hope this helps,
Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top