Multi-Threaded App

R

Robert Sheppard

I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.
 
N

Nicholas Paldino [.NET/C# MVP]

Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.
 
R

Robert Sheppard

Thanks... I will look at HttpWebRequest/HttpWebResponse. The old VB6 crawler
that I am porting from was using the WebBrowser control, which works fine
but very slow. Let me stress SLOW.
Thanks again for the help.

Nicholas Paldino said:
Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:
I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.
 
N

Nicholas Paldino [.NET/C# MVP]

Robert,

Do you have a specific need to parse the entire document, or are you
looking for specific parts? If you don't need to parse the entire document,
and what you are looking to scrape from the HTML is specific, then using
HttpWebRequest and HttpWebResponse will probably simplify things
considerably.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:
Thanks... I will look at HttpWebRequest/HttpWebResponse. The old VB6
crawler
that I am porting from was using the WebBrowser control, which works fine
but very slow. Let me stress SLOW.
Thanks again for the help.

in
message news:[email protected]...
Robert,

This would be difficult in this situation. You couldn't use the
WebBrowser control, because it needs to be tied to a UI thread.

You could use MSHTML through COM interop. However, you would have to
make sure that every thread that you use MSHTML on is set up so that the
ApartmentState for that thread is STA. I am not sure about this, but I also
believe you would have to pump messages in order for the events to work
correctly.

Needless to say, it's a better idea in this case to use
HttpWebRequest/HttpWebResponse and then take the content from those and set
the content of a new MSHTML instance in your thread to the content
downloaded. This way, you don't have to wait for MSHTML to download the
document, and you can work with it right away.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Robert Sheppard said:
I am new to C# and am trying to build a multi-threaded web crawler. I want
to crawl many sites all at once. I know how to use IHTMLDocument2 to parse
the document object but I want to launch multiple threads to parse each
induvidual web page.

With the WebBrowser control I can start parsing when I get the
Documet_Complete event but how can I do this with each web site on a
different thread? How are the Document_Complete events
handled in a multi-threaded environment?

This is an Asycronous operation and so I cannot see how it can be
done.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top