Parsing HTML

Y

Yosh

I have a web page that I call and I need to get the body text out of the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?

Thanks,

Yosh
 
S

Steve Walker

Yosh said:
I have a web page that I call and I need to get the body text out of
the HTML.
 
<html>
<body>
Hi.
How are you?
</body>
</html>
 
What is the best way to do this in CO# and .NET?

#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}
 
F

Frisky

Or, there is a nice parser on Code Project
(http://www.codeproject.com/dotnet/apmilhtml.asp)

BTW: Nice one Steve! I like the IHTMLDocument2 idea. I had not thought of
this one.

Frisky

Steve Walker said:
Yosh said:
I have a web page that I call and I need to get the body text out of
the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?

#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top