How to perform XPath queries on HTML?

  • Thread starter Siegfried Heintze
  • Start date
S

Siegfried Heintze

JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

thanks,
Siegfried
 
S

Scott M.

You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML. But, if you have XHTML, then you can simply load up an
XMLDomDocument with this XHTML and use XPath on it that.

-Scott
 
B

Barry Kelly

Siegfried said:
JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Use HtmlAgilityPack. It has a basic, lenient HTML parser and implements
IXPathNavigable and a basic DOM, so it can be searched using an XPath.

http://www.codeplex.com/htmlagilitypack
Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

WebRequest & WebResponse should be able to do this for you, no? Do you
have more specific questions about WebRequest.Create / etc?

-- Barry
 
B

Barry Kelly

Scott said:
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML.

A handy thing about the XPathNavigator class in .NET is that if you can
implement it (i.e. derive and implement its abstract methods) for your
arbitrary tree-shaped data structure, then you can query it using XPath.

-- Barry
 
S

Scott M.

But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?
 
B

Barry Kelly

Scott said:
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Like I said earlier, HtmlAgilityPack uses a very lenient but
deterministic HTML parser. It can make a tree out of just about any
source HTML; as long as the XPath query works on one instance of the
server side's generated HTML (assuming it's generated otherwise why
automate the querying?), then it should work on subsequent instances.

In other words, even if the HTML is malformed and results in a
non-compliant tree, the formation of the tree itself is deterministic
and so it ought to be consistently queryable.

-- Barry
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top