using MSHTML for parsing HTML files in c#

philipl · Sep 23, 2003

hi,

Does anyone have any sample code for this?? I can't find anything
relvant at all. Please share out some code if you have any.

thx

Pete · Sep 23, 2003

Hi,

Does anyone have any sample code for this?? I can't find anything
relvant at all. Please share out some code if you have any.

A simple google search yields quite a few good results. I suggest you learn
to improve your searching technique.

Or did you want someone to do it for you?

-- Pete

philipl · Sep 23, 2003

Pete said:
Hi,

A simple google search yields quite a few good results. I suggest you learn
to improve your searching technique.

Or did you want someone to do it for you?

-- Pete

ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.

Hasani · Sep 23, 2003

I did a small project using this a few weeks ago. My code will probably look
obscure but I can try to help u if you need more questions.
http://www.skidmore.edu/~h_blackw/mshtmlsample.cs is the stripped version of
my code that uses the mshtml library.

Sunny · Sep 23, 2003

ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.

Hi, here is a little example. I used this code to read an HTML page and
to replace some of the links in there, and after that to save the
result. The example is not full, but shows how to manipulate HTML page.

Hope that helps
Sunny

<snip>
try
{
mr = new StreamReader(source.OpenRead(sUrl));
sWebPage = mr.ReadToEnd();
}
catch
{ //could not read the URL
return;
}
finally
{
if (mr != null)
mr.Close();
}

HTMLDocumentClass myDoc;

try
{ //place the HTML string in MSHTML doc
object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);
}
catch
{
//page is not well formated, skip it
return;
}

// if we are here, we have read the page and we are ready to parce it

//get collection of links
IHTMLElementCollection cMyLinks = (IHTMLElementCollection)myDoc.links;

//modify the links
foreach (IHTMLAnchorElement oLink in cMyLinks)
oLink.href = SubstituteTags(true, sUrl, oLink.href);

//get collection of images
cMyLinks = (IHTMLElementCollection)myDoc.images;
//modify images
foreach (IHTMLImgElement oImage in cMyLinks)
oImage.src = SubstituteTags(false, sUrl, oImage.href);

//write the result
StreamWriter myFile = null;
sWebPage = myDoc.documentElement.outerHTML;
try
{
myFile = new StreamWriter("modpage.html", false);
myFile.Write(sWebPage);
}
catch{}
finally
{
if (myFile != null)
myFile.Close();
}

<snip>

Pete · Sep 24, 2003

Hi,

ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.

*sigh*. Okay. Depending on whether or not you're looking to display the
html, these might be of some use:

http://www.itwriting.com/htmleditor/index.php

http://msdn.microsoft.com/library/d...bbrowser/tutorials/webocstream.asp?frame=true

http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/hosting/hosting.asp

http://msdn.microsoft.com/library/d...ml/vsgrfwalkthroughaccessingdhtmldomfromc.asp

http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=312

http://www.thecodeproject.com/csharp/webbrowser.asp

http://www.codeproject.com/csharp/advhost.asp

http://blog.monstuff.com/archives/000052.html

If you just want to use mshtml without a ui, this might help:

http://www.codeguru.com/ieprogram/HTMLParsing.html

I know that last one isn't c#, but it should show what you need (especially
combined with those others). None of this was hard to find and I'm certain
there's a lot more out there that my brief search didn't pick up.

-- Pete

philipl · Sep 24, 2003

Hasani said:
I did a small project using this a few weeks ago. My code will probably look
obscure but I can try to help u if you need more questions.
http://www.skidmore.edu/~h_blackw/mshtmlsample.cs is the stripped version of
my code that uses the mshtml library.

Thx for the code! I have tried out both implementations but i still
can't access my html page. The problem I think is that the
HTMLDocmentClass does not seem to enumerate my html page properly. It
picks up what size it is etc, but <title> and <body> does not seem to
be enumerated. Can you spot what I maybe doing wrong?

thx

This is the simple html page i am trying to read:
<HTML>
<HEAD>
<TITLE>I Love HTML</TITLE>
</HEAD>

<BODY>
Everything displayed on your page will be in here.
</BODY>

</HTML>

Here is the code:

main()
{

//it seems that HTMLDocumentClass in 'Loadhtml' does not enumerate the
file properly. So I think the problem start here
HTMLDocument htmlDoc =
LoadHtml(@"D:\work\htmlparse\ConsoleApplication\hello.html");

IHTMLElementCollection title = htmlDoc.getElementsByTagName("title");

//nothing
Console.WriteLine(htmlDoc.title);

//no elements
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.ToString());
}

}

private static HTMLDocument LoadHtml(string path)
{
HTMLDocumentClass dom = new HTMLDocumentClass();
System.Runtime.InteropServices.UCOMIPersistFile pf =
(System.Runtime.InteropServices.UCOMIPersistFile)dom;
pf.Load(path,1);

return dom;
}

Sunny · Sep 25, 2003

I have posted yesterday, but it seems my post does not appear. The
following example works just fine.

Hope that helps
Sunny

string myPage = "<HTML><HEAD><TITLE>I Love HTML</TITLE></HEAD>" +
"<BODY>Everything displayed on your page will be in here.</BODY>" +
"</HTML>";

HTMLDocumentClass myDoc;

//loading the document !
object[] oPageText = {myPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

IHTMLElementCollection title = myDoc.getElementsByTagName("title");
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.text);
}

Console.WriteLine(myDoc.title);

philipl · Oct 6, 2003

Sunny said:
I have posted yesterday, but it seems my post does not appear. The
following example works just fine.

Hope that helps
Sunny

string myPage = "<HTML><HEAD><TITLE>I Love HTML</TITLE></HEAD>" +
"<BODY>Everything displayed on your page will be in here.</BODY>" +
"</HTML>";

HTMLDocumentClass myDoc;

//loading the document !
object[] oPageText = {myPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

IHTMLElementCollection title = myDoc.getElementsByTagName("title");
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.text);
}

Console.WriteLine(myDoc.title);

Thanks links and code guys.
-Sunny Thanks for the code, I was able to get what I need with this as
a start. Cheers.

parsing HTML using MSHTML	0	Nov 28, 2011
Using MSHTML	2	Feb 8, 2004
MSHTML in C#	1	Jul 22, 2005
Problem with .NET 2.0 and MSHTML	10	Nov 4, 2005
How to grab certain lines of text in a .txt file using C#	0	Mar 11, 2011
MHT/MSHTML displaying in the WebBrowser	0	Jan 5, 2007
MSHTML and MSXML in VB6	11	Dec 29, 2005
MSHTML...	2	Sep 8, 2003

using MSHTML for parsing HTML files in c#

philipl

Pete

philipl

Hasani

Sunny

Pete

philipl

Sunny

philipl

Ask a Question

Similar Threads