using mshtml to access HTML page elements

Taylor · Dec 8, 2003

I was pleased to find this that I could easily access all the links in a
page using this construct:

IHTMLDocument2 d = (IHTMLDocument2) ie.Document;
IHTMLElementCollection links = d.links;

but disappointed to find I couldn't do the same to get all my tables (using
something like d.tables). Instead I'm resorting to the naive approach of
iterating thru d.all casting to a table and picking out the objects that
didn't turn to null.

I realize its horribly inefficient to cast every object to a table and
checking for hits. Can you advise?

Here is the naive approach which is very slow:

SHDocVw.InternetExplorer ie = new SHDocVw.InternetExplorerClass();
object o = System.Reflection.Missing.Value;
object url = "file://" + Path.Combine(Directory.GetCurrentDirectory(),
@"..\..\test\test1.html");

ie.Navigate2(ref url,ref o,ref o,ref o,ref o);
while(ie.Busy){Thread.Sleep(2);}
IHTMLDocument2 d = (IHTMLDocument2) ie.Document;
IHTMLElementCollection all = d.all;
foreach (object el in all)
{
HTMLTableClass t = el as HTMLTableClass;
if(t!=null)
{
if( 3 == t.cells.length)
{
foreach(HTMLTableRow c in t.rows)
{
Console.WriteLine(c.innerText);
}
}

}
}

Taylor Monacelli · Dec 9, 2003

OK. I've made some progress in that I've found out why my naive approach
was so slow. Here is a well written piece from someone who seems to know
what he's talking about:

<snip>
David J. Marcus [@alhakol.com]

I have some fairly extensive experience traversing the DOM.

I can tell you unabashedly that the performance is absurdly bad.

To traverse a DOM of a medium sized web page on an 800MHz Pentium III using
C# takes up to 10 seconds !!!

I've posted the problem before and got no response from the folks at
Microsoft. Perhaps they are embarrassed by the results. The only response I
got was a vague reference to 'marshalling'.

In doing some more research, the problem turns out to be the marshalling of
data from the MSHTML control to the C# environment. In particular, be aware
that MSHTML creates a fully fleshed node for each HTML tag. This includes
ALL the possible the attributes the node can ever have. It then marks each
attribute with a flag (which can be tested) which is 'true' if the attribute
was actually specified in the HTML. This approach is necessary because some
of the attributes have inherited values (meaning that unless the user
explicitly specifies them in the HTML, they contain an inherited value [or a
default value]).

This short of it, there are typically 100 attributes for most HTML tag
types. Multiply this by the number of tags in your HTML page and you get an
idea of the number of marshalling calls required (assuming it is good enough
to marshal an attribute in one call.. if not, it is even worse).

By the way, traversing the same DOM in C++ is virtually instantaneously.

I hope this helps you.

-Regards David
</snip>

This was copied from http://www.dotnet247.com/247reference/msgs/8/41599.aspx

how to fill in html textbox from windows application	10	Jun 27, 2005
How to set value of <INPUT type=file...> element programmatically?	3	Aug 17, 2004

using mshtml to access HTML page elements

Taylor

Taylor Monacelli

Ask a Question

Similar Threads