Parsing HTML

  • Thread starter Thread starter Yosh
  • Start date Start date
Y

Yosh

I have a web page that I call and I need to get the body text out of the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?

Thanks,

Yosh
 
Yosh said:
I have a web page that I call and I need to get the body text out of
the HTML.
 
<html>
<body>
Hi.
How are you?
</body>
</html>
 
What is the best way to do this in CO# and .NET?

#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}
 
Or, there is a nice parser on Code Project
(http://www.codeproject.com/dotnet/apmilhtml.asp)

BTW: Nice one Steve! I like the IHTMLDocument2 idea. I had not thought of
this one.

Frisky

Steve Walker said:
Yosh said:
I have a web page that I call and I need to get the body text out of
the HTML.

<html>
<body>
Hi.
How are you?
</body>
</html>

What is the best way to do this in CO# and .NET?

#1 Treat it as a string and parse it using regular expressions.

#2 Use the Microsoft HTML Object Library (mshtml, add reference from COM
tab) to load and parse it, and access it through the document object
model:

using System;
using mshtml;
namespace HTMParse
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
string s = "<html><body>Hi.How are
you?</body></html>";
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[]{s});
doc.close();
Console.Write(doc.body.innerHTML);
Console.Read();
}
}
}
 
Back
Top