Parsing HTML with C#

P

Paul E Collins

I want to extract a few simple elements from an HTML document
(Transitional 4.0, so not necessarily something that can be handled by
an XML parser). I've got as far as adding a reference to
Microsoft.mshtml, but now I don't know what's what.

The IHtmlDocument interfaces seem to be potentially helpful, but what
class implements those interfaces (i.e. how can I create an instance
with a "new" expression)? Basically, how do I get from a text file to
a model of the HTML contents in memory?

Eq.
 
S

Sericinus hunter

Paul said:
I want to extract a few simple elements from an HTML document
(Transitional 4.0, so not necessarily something that can be handled by
an XML parser). I've got as far as adding a reference to
Microsoft.mshtml, but now I don't know what's what.

The IHtmlDocument interfaces seem to be potentially helpful, but what
class implements those interfaces (i.e. how can I create an instance
with a "new" expression)? Basically, how do I get from a text file to
a model of the HTML contents in memory?

In 1.1 AxSHDocVw.AxWebBrowser.Document does. So, here is how it can
be done:

wb = new AxSHDocVw.AxWebBrowser();
object oNone = Type.Missing;
wb.Navigate("about:blank", ref oNone, ref oNone, ref oNone, ref oNone);
while (wb.Busy)
{
Application.DoEvents();
}
string htmlString = "<html><body>Hello</body></html>";
IHTMLDocument2 doc = wb.Document as IHTMLDocument2;
doc.clear();
doc.open(null, null, null, null);
doc.write(htmlString);
doc.close();

Now you can access whatever the interface provides.
 
G

Guest

Paul,
I think the best answer to this would revolve around the question "Where is
the HTML coming from". if you want something that will give you the
flexibility of treating even "poorly formed" HTML as XML, I'd recommend Simon
Mourier's
HTMLAgilityPack.
Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top