WebBrowser Control/MSHTML - Performance of walking the HTML DOM

B

Bryan D.

My C# application is currently using the WebBrowser Control and the
MSHTML library to walk the HTML DOM of documents and pull out
information of certain tags that it finds. I've found this to be an
extremely slow process in C# and have found references in this
newsgroup to the fact that this is a known issue with .NET
marshalling.

This post at this address:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=u$FJlRgpBHA.2180@tkmsftngp07,
indicates that the performance is MUCH better in C++. I'm interested
in trying to build some unmanaged C++ to parse the HTML DOM and call
back into my managed code after parsing.

However, the code that I have written in C++ is just as slow as the
code in C#. I was wondering, does anyone have code that proves the
point that HTML DOM parsing is "instantaneous" in C++? I would really
appreciate any tips, code snippets, or links to example code.

Thank you very much,
Bryan

PS - Below is the C++ code I'm currently using to try this out, it
takes about 3 seconds!!! on a medium-small size document.

----------------- snip -----------------------------

#include "ole2.h"
#include <iostream>

#import <shdocvw.dll>
#import <mshtml.tlb>

void WalkChildElements( MSHTML::IHTMLElementPtr element )
{
MSHTML::IHTMLElementCollectionPtr children;
IDispatch* pDisp;
element->get_children(&pDisp);
pDisp->QueryInterface(&children);

long length;
children->get_length(&length);
for( int i = 0; i < length; i++ )
{
MSHTML::IHTMLElementPtr child;
child = children->item( (long)i, (long)i );
WalkChildElements(child);
}
}

int _tmain(int argc, _TCHAR* argv[])
{
CoInitialize(0);
{
SHDocVw::IWebBrowser2Ptr
pIE(__uuidof(SHDocVw::InternetExplorer));
MSHTML::IHTMLDocument2Ptr pHTMLDoc;
MSHTML::IHTMLDocument3Ptr pHTMLDoc3;

pIE->Visible = true;
pIE->Navigate("file://c:/tmp.html");
while(pIE->GetBusy())
Sleep(100);
pIE->GetDocument()->QueryInterface(&pHTMLDoc);
pIE->GetDocument()->QueryInterface(&pHTMLDoc3);

MSHTML::IHTMLElementPtr element;
pHTMLDoc3->get_documentElement(&element);
std::cout << "Begin\n";
WalkChildElements( element );
std::cout << "Done\n";

}
CoUninitialize();
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top