using mshtml to access HTML page elements

T

Taylor

I was pleased to find this that I could easily access all the links in a
page using this construct:

IHTMLDocument2 d = (IHTMLDocument2) ie.Document;
IHTMLElementCollection links = d.links;

but disappointed to find I couldn't do the same to get all my tables (using
something like d.tables). Instead I'm resorting to the naive approach of
iterating thru d.all casting to a table and picking out the objects that
didn't turn to null.

I realize its horribly inefficient to cast every object to a table and
checking for hits. Can you advise?





Here is the naive approach which is very slow:

SHDocVw.InternetExplorer ie = new SHDocVw.InternetExplorerClass();
object o = System.Reflection.Missing.Value;
object url = "file://" + Path.Combine(Directory.GetCurrentDirectory(),
@"..\..\test\test1.html");

ie.Navigate2(ref url,ref o,ref o,ref o,ref o);
while(ie.Busy){Thread.Sleep(2);}
IHTMLDocument2 d = (IHTMLDocument2) ie.Document;
IHTMLElementCollection all = d.all;
foreach (object el in all)
{
HTMLTableClass t = el as HTMLTableClass;
if(t!=null)
{
if( 3 == t.cells.length)
{
foreach(HTMLTableRow c in t.rows)
{
Console.WriteLine(c.innerText);
}
}

}
}
 
T

Taylor Monacelli

OK. I've made some progress in that I've found out why my naive approach
was so slow. Here is a well written piece from someone who seems to know
what he's talking about:

<snip>
David J. Marcus [@alhakol.com]

I have some fairly extensive experience traversing the DOM.

I can tell you unabashedly that the performance is absurdly bad.

To traverse a DOM of a medium sized web page on an 800MHz Pentium III using
C# takes up to 10 seconds !!!

I've posted the problem before and got no response from the folks at
Microsoft. Perhaps they are embarrassed by the results. The only response I
got was a vague reference to 'marshalling'.

In doing some more research, the problem turns out to be the marshalling of
data from the MSHTML control to the C# environment. In particular, be aware
that MSHTML creates a fully fleshed node for each HTML tag. This includes
ALL the possible the attributes the node can ever have. It then marks each
attribute with a flag (which can be tested) which is 'true' if the attribute
was actually specified in the HTML. This approach is necessary because some
of the attributes have inherited values (meaning that unless the user
explicitly specifies them in the HTML, they contain an inherited value [or a
default value]).

This short of it, there are typically 100 attributes for most HTML tag
types. Multiply this by the number of tags in your HTML page and you get an
idea of the number of marshalling calls required (assuming it is good enough
to marshal an attribute in one call.. if not, it is even worse).

By the way, traversing the same DOM in C++ is virtually instantaneously.

I hope this helps you.

-Regards David
</snip>

This was copied from http://www.dotnet247.com/247reference/msgs/8/41599.aspx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top