using MSHTML for parsing HTML files in c#

P

philipl

hi,

Does anyone have any sample code for this?? I can't find anything
relvant at all. Please share out some code if you have any.

thx
 
P

Pete

Hi,

Does anyone have any sample code for this?? I can't find anything
relvant at all. Please share out some code if you have any.

A simple google search yields quite a few good results. I suggest you learn
to improve your searching technique.

Or did you want someone to do it for you?

-- Pete
 
P

philipl

Pete said:
Hi,



A simple google search yields quite a few good results. I suggest you learn
to improve your searching technique.

Or did you want someone to do it for you?

-- Pete


ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.
 
S

Sunny

ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.

Hi, here is a little example. I used this code to read an HTML page and
to replace some of the links in there, and after that to save the
result. The example is not full, but shows how to manipulate HTML page.

Hope that helps
Sunny

<snip>
try
{
mr = new StreamReader(source.OpenRead(sUrl));
sWebPage = mr.ReadToEnd();
}
catch
{ //could not read the URL
return;
}
finally
{
if (mr != null)
mr.Close();
}

HTMLDocumentClass myDoc;

try
{ //place the HTML string in MSHTML doc
object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);
}
catch
{
//page is not well formated, skip it
return;
}

// if we are here, we have read the page and we are ready to parce it

//get collection of links
IHTMLElementCollection cMyLinks = (IHTMLElementCollection)myDoc.links;

//modify the links
foreach (IHTMLAnchorElement oLink in cMyLinks)
oLink.href = SubstituteTags(true, sUrl, oLink.href);

//get collection of images
cMyLinks = (IHTMLElementCollection)myDoc.images;
//modify images
foreach (IHTMLImgElement oImage in cMyLinks)
oImage.src = SubstituteTags(false, sUrl, oImage.href);

//write the result
StreamWriter myFile = null;
sWebPage = myDoc.documentElement.outerHTML;
try
{
myFile = new StreamWriter("modpage.html", false);
myFile.Write(sWebPage);
}
catch{}
finally
{
if (myFile != null)
myFile.Close();
}

<snip>
 
P

Pete

Hi,

ehh... did you look at those results??? If you have found anything
useful, pass it on, as I certainly couldn't, find anything more then 5
lines of code.
Otherwise quit spreading negative vibes.

*sigh*. Okay. Depending on whether or not you're looking to display the
html, these might be of some use:

http://www.itwriting.com/htmleditor/index.php

http://msdn.microsoft.com/library/d...bbrowser/tutorials/webocstream.asp?frame=true

http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/hosting/hosting.asp

http://msdn.microsoft.com/library/d...ml/vsgrfwalkthroughaccessingdhtmldomfromc.asp

http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=312

http://www.thecodeproject.com/csharp/webbrowser.asp

http://www.codeproject.com/csharp/advhost.asp

http://blog.monstuff.com/archives/000052.html

If you just want to use mshtml without a ui, this might help:

http://www.codeguru.com/ieprogram/HTMLParsing.html

I know that last one isn't c#, but it should show what you need (especially
combined with those others). None of this was hard to find and I'm certain
there's a lot more out there that my brief search didn't pick up.

-- Pete
 
P

philipl

Hasani said:
I did a small project using this a few weeks ago. My code will probably look
obscure but I can try to help u if you need more questions.
http://www.skidmore.edu/~h_blackw/mshtmlsample.cs is the stripped version of
my code that uses the mshtml library.



Thx for the code! I have tried out both implementations but i still
can't access my html page. The problem I think is that the
HTMLDocmentClass does not seem to enumerate my html page properly. It
picks up what size it is etc, but <title> and <body> does not seem to
be enumerated. Can you spot what I maybe doing wrong?

thx

This is the simple html page i am trying to read:
<HTML>
<HEAD>
<TITLE>I Love HTML</TITLE>
</HEAD>

<BODY>
Everything displayed on your page will be in here.
</BODY>

</HTML>

Here is the code:

main()
{

//it seems that HTMLDocumentClass in 'Loadhtml' does not enumerate the
file properly. So I think the problem start here
HTMLDocument htmlDoc =
LoadHtml(@"D:\work\htmlparse\ConsoleApplication\hello.html");

IHTMLElementCollection title = htmlDoc.getElementsByTagName("title");

//nothing
Console.WriteLine(htmlDoc.title);

//no elements
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.ToString());
}



}


private static HTMLDocument LoadHtml(string path)
{
HTMLDocumentClass dom = new HTMLDocumentClass();
System.Runtime.InteropServices.UCOMIPersistFile pf =
(System.Runtime.InteropServices.UCOMIPersistFile)dom;
pf.Load(path,1);

return dom;
}
 
S

Sunny

I have posted yesterday, but it seems my post does not appear. The
following example works just fine.

Hope that helps
Sunny

string myPage = "<HTML><HEAD><TITLE>I Love HTML</TITLE></HEAD>" +
"<BODY>Everything displayed on your page will be in here.</BODY>" +
"</HTML>";

HTMLDocumentClass myDoc;

//loading the document !
object[] oPageText = {myPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

IHTMLElementCollection title = myDoc.getElementsByTagName("title");
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.text);
}

Console.WriteLine(myDoc.title);
 
P

philipl

Sunny said:
I have posted yesterday, but it seems my post does not appear. The
following example works just fine.

Hope that helps
Sunny

string myPage = "<HTML><HEAD><TITLE>I Love HTML</TITLE></HEAD>" +
"<BODY>Everything displayed on your page will be in here.</BODY>" +
"</HTML>";

HTMLDocumentClass myDoc;

//loading the document !
object[] oPageText = {myPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

IHTMLElementCollection title = myDoc.getElementsByTagName("title");
foreach(IHTMLTitleElement myt in title)
{
Console.WriteLine(myt.text);
}

Console.WriteLine(myDoc.title);

Thanks links and code guys.
-Sunny Thanks for the code, I was able to get what I need with this as
a start. Cheers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Using MSHTML 2
MSHTML in C# 1
Problem with .NET 2.0 and MSHTML 10
MHT/MSHTML displaying in the WebBrowser 0
parsing HTML using MSHTML 0
CodeDom and pasing csharp code 2
HTML and C# 4
HTML to XML 1

Top