How to perform XPath queries on HTML?

Siegfried Heintze · Dec 2, 2007

JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

thanks,
Siegfried

Scott M. · Dec 2, 2007

You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML. But, if you have XHTML, then you can simply load up an
XMLDomDocument with this XHTML and use XPath on it that.

-Scott

Barry Kelly · Dec 3, 2007

Siegfried said:
JTidy is a java library that will populate an XML DOM from an HTML string.
The XML DOM has XPATH. Is there a similar library for C# and VB.NET
programers that will allow me to perform XPATH queries on HTML?

Use HtmlAgilityPack. It has a basic, lenient HTML parser and implements
IXPathNavigable and a basic DOM, so it can be searched using an XPath.

http://www.codeplex.com/htmlagilitypack

Also, what is the name of the HTTP client that will allow me to fetch the
HTML from a web site?

WebRequest & WebResponse should be able to do this for you, no? Do you
have more specific questions about WebRequest.Create / etc?

-- Barry

Barry Kelly · Dec 3, 2007

Scott said:
You can only perform XPath operations on XML, so, by definition, it can't be
used with HTML.

A handy thing about the XPathNavigator class in .NET is that if you can
implement it (i.e. derive and implement its abstract methods) for your
arbitrary tree-shaped data structure, then you can query it using XPath.

-- Barry

Scott M. · Dec 4, 2007

But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Barry Kelly · Dec 5, 2007

Scott said:
But, since HTML may not be a well-formed tree structure, wouldn't you have
problems querying it?

Like I said earlier, HtmlAgilityPack uses a very lenient but
deterministic HTML parser. It can make a tree out of just about any
source HTML; as long as the XPath query works on one instance of the
server side's generated HTML (assuming it's generated otherwise why
automate the querying?), then it should work on subsequent instances.

In other words, even if the HTML is malformed and results in a
non-compliant tree, the formation of the tree itself is deterministic
and so it ought to be consistently queryable.

-- Barry

How to XPATH on HTML?	2	Jan 29, 2008
Processing XML With C# and .NET	0	Apr 20, 2014
Processing XML With C# and .NET	1	Apr 20, 2014
Need XPath help	2	Oct 3, 2005
XML XPath Query compact framework	5	Feb 8, 2004
How to retreive deepest XPath value from XML using VB.NET	3	Feb 28, 2006
XPath parsing of HTML files	4	Oct 16, 2007
.NET requirement ( local in DC )	0	Jun 15, 2007

How to perform XPath queries on HTML?

Siegfried Heintze

Scott M.

Barry Kelly

Barry Kelly

Scott M.

Barry Kelly

Ask a Question

Similar Threads