Web Crawler Architecture

Clayton · Jun 25, 2008

Hi all,

Let me put you in the picture... I was assigned to design and develop
a focused/topical web crawler. The expected outcome of the project
will be a search component that will automatically crawl the WWW.
Initially, the user (which is the service provider) will feed the
crawler with relevant links and documents in order to specify the
required information. The crawler will then start extracting links and
other relevant information from the websites visited. By using latent
semantic analysis and other techniques, the crawler will accept or
discard links according the content relevance to the subject. This
will ensure that the corpus of documents is highly relevant to the
subject requested by the user.

Multiple threads are envisaged to be used in order to crawl, analyze,
save and index websites' information (this will increase the
component's performance). This component can be accessed by authorized
third parties (the clients) via a web service. The web service will
require a query as an input and will return a set of results.

So to recap, if for example the user (service provider) feeds the
component with documents and links related to motor sports, the
crawler should automatically build a corpus of documents which are
highly relevant to motor sports. Then third parties can search this
corpus of documents through the use of a web service.

Although speed is not a major issue at the moment, I would like to
plan this from ground up. So the questions that I have at the moment
are;

1..Is C# the ideal programming language for this task?
2..What is the best and fastest way to store, index and retrieve
information? Shall I save all the information in a database or save it
in a flat file?!
3..Shall I keep the unvisited URL queue in the main memory using the
Queue<T> object or save them in a database? Please note that the URL
queue will be continuously accessed to enqueue and dequeue URLs by the
crawlers.

Please forward any valuable material you might have regarding this
subject. Any suggestions, comments and questions are greatly
appreciated. Thanks in advanced for your help and time.

Regards,
Clayton

Peter Bromberg [C# MVP] · Jun 25, 2008

Igor Moochnik did a very nice webcrawler implementation using the CCR to
control the threading:

http://igor.moochnick.googlepages.com/ccrpresentation1

Peter

Ignacio Machin ( .NET/ C# MVP ) · Jun 25, 2008

Hi all,

Let me put you in the picture... I was assigned to design and develop
a focused/topical web crawler. The expected outcome of the project
will be a search component that will automatically crawl the WWW.
Initially, the user (which is the service provider) will feed the
crawler with relevant links and documents in order to specify the
required information. The crawler will then start extracting links and
other relevant information from the websites visited. By using latent
semantic analysis and other techniques, the crawler will accept or
discard links according the content relevance to the subject. This
will ensure that the corpus of documents is highly relevant to the
subject requested by the user.

Multiple threads are envisaged to be used in order to crawl, analyze,
save and index websites' information (this will increase the
component's performance). This component can be accessed by authorized
third parties (the clients) via a web service. The web service will
require a query as an input and will return a set of results.

So to recap, if for example the user (service provider) feeds the
component with documents and links related to motor sports, the
crawler should automatically build a corpus of documents which are
highly relevant to motor sports. Then third parties can search this
corpus of documents through the use of a web service.

Although speed is not a major issue at the moment, I would like to
plan this from ground up. So the questions that I have at the moment
are;

1..Is C# the ideal programming language for this task?
2..What is the best and fastest way to store, index and retrieve
information? Shall I save all the information in a database or save it
in a flat file?!
3..Shall I keep the unvisited URL queue in the main memory using the
Queue<T> object or save them in a database? Please note that the URL
queue will be continuously accessed to enqueue and dequeue URLs by the
crawlers.

Please forward any valuable material you might have regarding this
subject. Any suggestions, comments and questions are greatly
appreciated. Thanks in advanced for your help and time.

Regards,
Clayton

Hi
A search in google will help you, the scope of the project (www
crawling) is beyond the scope of this NG

Clayton · Jun 26, 2008

Igor Moochnik did a very nice webcrawler implementation using the CCR to
control the threading:

http://igor.moochnick.googlepages.com/ccrpresentation1

Peter,

Thanks for the information... very useful indeed!!

Clayton

crawler pool	1	Dec 30, 2003
Crawler Toolbar - EULA advice needed	5	Dec 1, 2005
Multi-Threaded App	3	Feb 14, 2008
Web Crawler	7	Oct 18, 2005
How to extract all links/url from web page?	2	May 5, 2007
Downloading files	1	Apr 29, 2006
C# Crawler and performance (speed of crawling)	1	Jun 14, 2005
Hiding a line of text from the SE crawler	16	Jul 10, 2004

Web Crawler Architecture

Clayton

Peter Bromberg [C# MVP]

Ignacio Machin ( .NET/ C# MVP )

Clayton

Ask a Question

Similar Threads