Multi-Threaded App

Robert Sheppard · Feb 14, 2008

I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.

Nicholas Paldino [.NET/C# MVP] · Feb 14, 2008

Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.

Robert Sheppard · Feb 14, 2008

Thanks... I will look at HttpWebRequest/HttpWebResponse. The old VB6 crawler
that I am porting from was using the WebBrowser control, which works fine
but very slow. Let me stress SLOW.
Thanks again for the help.

Nicholas Paldino said:
Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:

I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.

Click to expand...

Nicholas Paldino [.NET/C# MVP] · Feb 15, 2008

Robert,

Do you have a specific need to parse the entire document, or are you
looking for specific parts? If you don't need to parse the entire document,
and what you are looking to scrape from the HTML is specific, then using
HttpWebRequest and HttpWebResponse will probably simplify things
considerably.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:
Thanks... I will look at HttpWebRequest/HttpWebResponse. The old VB6
crawler
that I am porting from was using the WebBrowser control, which works fine
but very slow. Let me stress SLOW.
Thanks again for the help.

in
message news:[email protected]...

Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:

I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.

Click to expand...

Click to expand...

Multi threaded failure	2	Nov 4, 2009
book or article suggestion for multi-threaded winforms reading?	3	Sep 10, 2006
Using WebBrowser-Control in multi threaded apps (.Net 2.0)	6	Oct 31, 2007
How To Determine When a Page Is Done Loading in WebBrowser Control	6	Sep 13, 2005
multi-threaded request counter	4	Mar 25, 2008
Static Functions in a Multi Threaded App	9	Jun 14, 2005
strange behaviour for the multi-threaded calculation in Excel 2007	1	Jul 9, 2009
Multi threaded app database access	5	Apr 24, 2006

Multi-Threaded App

Robert Sheppard

Nicholas Paldino [.NET/C# MVP]

Robert Sheppard

Nicholas Paldino [.NET/C# MVP]

Ask a Question

Similar Threads