Want to use msHTML.HTMLDocumentClass

A

Atul

Hi,
I just stepped into C#. I am facing one problem. Here is the scenario. I am
posting some data to one website (https site). It returns me response HTML.
I want to parse this HTML and want to look for some specific tags. To
achieve the same, I wanted to use MSHTMLs HTMLDocumentClass Interface. I
dont know how to use it in C#.
I am using .NET framework 1.1.
For posting the data, I am using XMLHTTP class.

Any help is highly appreciated.

Thanks
Atul
 
?

=?ISO-8859-2?Q?Marcin_Sm=F3=B3ka?=

Dnia 2003-11-13 12:53, U¿ytkownik Atul napisa³:
Hi,
I just stepped into C#. I am facing one problem. Here is the scenario. I am
posting some data to one website (https site). It returns me response HTML.
I want to parse this HTML and want to look for some specific tags. To
achieve the same, I wanted to use MSHTMLs HTMLDocumentClass Interface. I
dont know how to use it in C#.
I am using .NET framework 1.1.
For posting the data, I am using XMLHTTP class.

Any help is highly appreciated.

Thanks
Atul
Hi,

First you should add refernece to your project - Microsoft.mshtml,
then by namespace "mshtml" you will have access HTMLDocumentClass

M.S
 
A

Atul

Hi,
I did the way you just told. Let me clear about the problem.
I am having one HTML text stored in a string say "strResponseHTML" variable.
The code which I am trying to use is as follows:

private void button1_Click(object sender, System.EventArgs e)
{
HTMLDocument htmlDoc = new HTMLDocumentClass();
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocumentClass();
try
{
string strResponse = getHtml();
htmlDoc.write(strResponse);
if (null != htmlDoc)
{
MessageBox.Show(htmlDoc.all.length.ToString());
}
richTextBox1.Text= htmlDoc.toString();
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + "\n"+ex.StackTrace);
}
}

Now upon seeing the code, what actually I am trying to do is that I want to
parse the response HTML to look for some objects with the values. Those
values I would like to compare.
We use MSXML2.dll to parse the XML document using XMLDOM, similary we "can"
use MSHTML.dll to parse the HTML document using htmlDocument object. But how
to load the strResponseHTML into HTMLDocument object, that I want to know.
If any body could help me out, highly appreaciated.
Thanks
Atul
 
N

Nicholas Paldino [.NET/C# MVP]

Atul,

In order to do this, you will have to use one of the IPersist interfaces
that HTMLDocument implements. You can use the IPersistFile interface,
saving your string to a file and then loading it that way, or you could use
the IPersistMemory interface, placing your string into unmanaged memory and
then passing the pointer to that string in memory.

However, if you have relative URLs in this document which you need to
have resolved correctly, then this will not work. The reason is that the
class doesn't know about where the document came from, and can not resolve
these accordingly. In this case, you will have to create an implementation
of IMoniker (which is represented in the System.Runtime.InteropServices
namespace with a name of UCOMIMoniker).

When you create this implementation, you have to implement the
BindToStorage method so that it will return an IStream implementation (the
COM interface) which is asked for eventually. This IStream will stream the
string that you had back to MSHTML.

Also, you will want to implement the GetDisplayName method of the
UCOMIMoniker interface so that it returns the url that this string was
downloaded from.

Hope this helps.
 
S

Sunny

Hi Atul,
I'm using this:

object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

This works fine. Of course you have to put this in try/catch block in
order the page is not well formated.

Hope that helps
Sunny
 
H

Hasani

it might be helpful to preparse the file to remove scripting as mentioned in
the following link
http://216.239.39.104/search?q=cach...&tid=8056+vbcity+mshtml+script&hl=en&ie=UTF-8

I know it has helped me a great deal.
Sunny said:
Hi Atul,
I'm using this:

object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

This works fine. Of course you have to put this in try/catch block in
order the page is not well formated.

Hope that helps
Sunny


Hi,
I did the way you just told. Let me clear about the problem.
I am having one HTML text stored in a string say "strResponseHTML" variable.
The code which I am trying to use is as follows:

private void button1_Click(object sender, System.EventArgs e)
{
HTMLDocument htmlDoc = new HTMLDocumentClass();
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocumentClass();
try
{
string strResponse = getHtml();
htmlDoc.write(strResponse);
if (null != htmlDoc)
{
MessageBox.Show(htmlDoc.all.length.ToString());
}
richTextBox1.Text= htmlDoc.toString();
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + "\n"+ex.StackTrace);
}
}

Now upon seeing the code, what actually I am trying to do is that I want to
parse the response HTML to look for some objects with the values. Those
values I would like to compare.
We use MSXML2.dll to parse the XML document using XMLDOM, similary we "can"
use MSHTML.dll to parse the HTML document using htmlDocument object. But how
to load the strResponseHTML into HTMLDocument object, that I want to know.
If any body could help me out, highly appreaciated.
Thanks
Atul

scenario. I
am response
HTML.
 
S

Sunny

Hi Hasani,
Thanks for the link.
Btw, do IHTMLDocument2.write(object[]) is executing the scripts?
I do not display the document, just load it.

It will be very helpful if you paste any link where I can read more
about it.

Thanks
Sunny

HJB417 said:
it might be helpful to preparse the file to remove scripting as mentioned in
the following link
http://216.239.39.104/search?q=cach...&tid=8056+vbcity+mshtml+script&hl=en&ie=UTF-8

I know it has helped me a great deal.
Sunny said:
Hi Atul,
I'm using this:

object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

This works fine. Of course you have to put this in try/catch block in
order the page is not well formated.

Hope that helps
Sunny


Hi,
I did the way you just told. Let me clear about the problem.
I am having one HTML text stored in a string say "strResponseHTML" variable.
The code which I am trying to use is as follows:

private void button1_Click(object sender, System.EventArgs e)
{
HTMLDocument htmlDoc = new HTMLDocumentClass();
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocumentClass();
try
{
string strResponse = getHtml();
htmlDoc.write(strResponse);
if (null != htmlDoc)
{
MessageBox.Show(htmlDoc.all.length.ToString());
}
richTextBox1.Text= htmlDoc.toString();
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + "\n"+ex.StackTrace);
}
}

Now upon seeing the code, what actually I am trying to do is that I want to
parse the response HTML to look for some objects with the values. Those
values I would like to compare.
We use MSXML2.dll to parse the XML document using XMLDOM, similary we "can"
use MSHTML.dll to parse the HTML document using htmlDocument object. But how
to load the strResponseHTML into HTMLDocument object, that I want to know.
If any body could help me out, highly appreaciated.
Thanks
Atul

Dnia 2003-11-13 12:53, U?ytkownik Atul napisa?:

Hi,
I just stepped into C#. I am facing one problem. Here is the scenario. I
am
posting some data to one website (https site). It returns me response
HTML.
I want to parse this HTML and want to look for some specific tags. To
achieve the same, I wanted to use MSHTMLs HTMLDocumentClass Interface. I
dont know how to use it in C#.
I am using .NET framework 1.1.
For posting the data, I am using XMLHTTP class.

Any help is highly appreciated.

Thanks
Atul


Hi,

First you should add refernece to your project - Microsoft.mshtml,
then by namespace "mshtml" you will have access HTMLDocumentClass

M.S
 
H

Hasani

http://support.microsoft.com/defaul...port/kb/articles/Q266/3/43.asp&NoWebContent=1

I was trying to do it as microsoft told me, but the solution @ vbcity was
the quickest and simplest to implement. Microsoft's solution will probably
have the page loaded and parsed faster though.

Also, there's a property of one of the HTMLElement classes called 'all' and
it will return all HTML elements contained inside the invoked HTMLElement

I use the IPersistFile 'method' to load the html code. I had problems when
using HTMLDocument.write.
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=uermym0CCHA.2716@cpmsftngxa08

and the code I use the load the documents pretty much looks like this

htmlDoc = new HTMLDocumentClass();

System.Runtime.InteropServices.UCOMIPersistFile pf =
(System.Runtime.InteropServices.UCOMIPersistFile)htmlDoc;

pf.Load(htmlFilename, 0);

while(htmlDoc.body == null)

System.Windows.Forms.Application.DoEvents();

Sunny said:
Hi Hasani,
Thanks for the link.
Btw, do IHTMLDocument2.write(object[]) is executing the scripts?
I do not display the document, just load it.

It will be very helpful if you paste any link where I can read more
about it.

Thanks
Sunny

HJB417 said:
it might be helpful to preparse the file to remove scripting as mentioned in
the following link
http://216.239.39.104/search?q=cach...&tid=8056+vbcity+mshtml+script&hl=en&ie=UTF-8

I know it has helped me a great deal.
Sunny said:
Hi Atul,
I'm using this:

object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

This works fine. Of course you have to put this in try/catch block in
order the page is not well formated.

Hope that helps
Sunny


 
A

Atul

hi Sunny,
Thanks for the response. Check the following code and see what am i missing
here?
private void button1_Click(object sender, System.EventArgs e)

{
HTMLDocument htmlDoc = new HTMLDocumentClass();
IHTMLDocument2 doc = (IHTMLDocument2)htmlDoc;

try
{
object[] oResponseHTML= (object[])getHtml();
MessageBox.Show(oResponseHTML.ToString());
doc.write(oResponseHTML);
if (null != doc)
{

MessageBox.Show(htmlDoc.all.length.ToString()+"\n"+doc.all.length.ToString()
);
}
richTextBox1.Text = "";

for(int i=0; i<doc.all.length;i++)
{
richTextBox1.Text = richTextBox1.Text + doc.all.item(null,i);
}

//richTextBox1.Text= htmlDoc.toString();
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + "\n"+ex.StackTrace);
}
}

private object[] getHtml()
{
object[] strHTML=null;

XMLHTTP30Class xmlDoc = new XMLHTTP30Class();
xmlDoc.open("POST","http://www.google.com",false,"","");
xmlDoc.send(null);
if (xmlDoc.statusText.ToUpper() == "OK")
strHTML = (object[])xmlDoc.responseStream;
xmlDoc=null;
return strHTML;
}

Can you please correct this code?
Thanks
Atul


Sunny said:
Hi Atul,
I'm using this:

object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);

This works fine. Of course you have to put this in try/catch block in
order the page is not well formated.

Hope that helps
Sunny


Hi,
I did the way you just told. Let me clear about the problem.
I am having one HTML text stored in a string say "strResponseHTML" variable.
The code which I am trying to use is as follows:

private void button1_Click(object sender, System.EventArgs e)
{
HTMLDocument htmlDoc = new HTMLDocumentClass();
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocumentClass();
try
{
string strResponse = getHtml();
htmlDoc.write(strResponse);
if (null != htmlDoc)
{
MessageBox.Show(htmlDoc.all.length.ToString());
}
richTextBox1.Text= htmlDoc.toString();
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + "\n"+ex.StackTrace);
}
}

Now upon seeing the code, what actually I am trying to do is that I want to
parse the response HTML to look for some objects with the values. Those
values I would like to compare.
We use MSXML2.dll to parse the XML document using XMLDOM, similary we "can"
use MSHTML.dll to parse the HTML document using htmlDocument object. But how
to load the strResponseHTML into HTMLDocument object, that I want to know.
If any body could help me out, highly appreaciated.
Thanks
Atul

scenario. I
am response
HTML.
 
S

Sunny

Hi Atul,

hi Sunny,
Thanks for the response. Check the following code and see what am i missing
here?
private object[] getHtml()
{
object[] strHTML=null;

XMLHTTP30Class xmlDoc = new XMLHTTP30Class();
xmlDoc.open("POST","http://www.google.com",false,"","");
xmlDoc.send(null);
if (xmlDoc.statusText.ToUpper() == "OK")
strHTML = (object[])xmlDoc.responseStream;
xmlDoc=null;
return strHTML;
}


For reading the web page I'm using:

string sUrl = "http://www.google.com";
System.Net.WebClient source = new System.Net.WebClient();
StreamReader mr = null;

try
{
mr = new StreamReader(source.OpenRead(sUrl));
sWebPage = mr.ReadToEnd();
}
catch
{
oParent.PagesDone++;
return;
}
finally
{
if (mr != null)
mr.Close();
}


Now, in sWebPage you have the HTML document. And you can transform it in
object[]:

object[] oPageText = {sWebPage};

Sunny
 
S

Sunny

Hi Hasani,
thanks for the response. I still do not want to use save/read a file, so
the solution in that link may help. The problem is (as always) that
myknowledge in C++ is not something I'm proud of :). I have posted a new
thread to ask for help with the translation.

Thanks
Sunny
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top