Web Crawler Architecture

C

Clayton

Hi all,

Let me put you in the picture... I was assigned to design and develop
a focused/topical web crawler. The expected outcome of the project
will be a search component that will automatically crawl the WWW.
Initially, the user (which is the service provider) will feed the
crawler with relevant links and documents in order to specify the
required information. The crawler will then start extracting links and
other relevant information from the websites visited. By using latent
semantic analysis and other techniques, the crawler will accept or
discard links according the content relevance to the subject. This
will ensure that the corpus of documents is highly relevant to the
subject requested by the user.

Multiple threads are envisaged to be used in order to crawl, analyze,
save and index websites' information (this will increase the
component's performance). This component can be accessed by authorized
third parties (the clients) via a web service. The web service will
require a query as an input and will return a set of results.

So to recap, if for example the user (service provider) feeds the
component with documents and links related to motor sports, the
crawler should automatically build a corpus of documents which are
highly relevant to motor sports. Then third parties can search this
corpus of documents through the use of a web service.

Although speed is not a major issue at the moment, I would like to
plan this from ground up. So the questions that I have at the moment
are;

1..Is C# the ideal programming language for this task?
2..What is the best and fastest way to store, index and retrieve
information? Shall I save all the information in a database or save it
in a flat file?!
3..Shall I keep the unvisited URL queue in the main memory using the
Queue<T> object or save them in a database? Please note that the URL
queue will be continuously accessed to enqueue and dequeue URLs by the
crawlers.

Please forward any valuable material you might have regarding this
subject. Any suggestions, comments and questions are greatly
appreciated. Thanks in advanced for your help and time.

Regards,
Clayton
 
I

Ignacio Machin ( .NET/ C# MVP )

Hi all,

Let me put you in the picture... I was assigned to design and develop
a focused/topical web crawler. The expected outcome of the project
will be a search component that will automatically crawl the WWW.
Initially, the user (which is the service provider) will feed the
crawler with relevant links and documents in order to specify the
required information. The crawler will then start extracting links and
other relevant information from the websites visited. By using latent
semantic analysis and other techniques, the crawler will accept or
discard links according the content relevance to the subject. This
will ensure that the corpus of documents is highly relevant to the
subject requested by the user.

Multiple threads are envisaged to be used in order to crawl, analyze,
save and index websites' information (this will increase the
component's performance). This component can be accessed by authorized
third parties (the clients) via a web service. The web service will
require a query as an input and will return a set of results.

So to recap, if for example the user (service provider) feeds the
component with documents and links related to motor sports, the
crawler should automatically build a corpus of documents which are
highly relevant to motor sports. Then third parties can search this
corpus of documents through the use of a web service.

Although speed is not a major issue at the moment, I would like to
plan this from ground up. So the questions that I have at the moment
are;

1..Is C# the ideal programming language for this task?
2..What is the best and fastest way to store, index and retrieve
information? Shall I save all the information in a database or save it
in a flat file?!
3..Shall I keep the unvisited URL queue in the main memory using the
Queue<T> object or save them in a database? Please note that the URL
queue will be continuously accessed to enqueue and dequeue URLs by the
crawlers.

Please forward any valuable material you might have regarding this
subject. Any suggestions, comments and questions are greatly
appreciated. Thanks in advanced for your help and time.

Regards,
Clayton

Hi
A search in google will help you, the scope of the project (www
crawling) is beyond the scope of this NG
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top