Ferreting out broken links, Part 2

  • Thread starter Thread starter Dave
  • Start date Start date
D

Dave

Hello All,

A couple of weeks ago, I undertook to write a utility that would loop
through various URLs and test whether they were valid. I got some good help
from this list, and was able to write the utility.

Now, I have run into a problem that is difficult for me to solve. It is
this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.

I am including code stubs below that will illustrate this.


Dim req As System.Net.HttpWebRequest
Dim resp As System.Net.HttpWebResponse

for i = 0 to 1000

s = "www.google.com"

req = System.Net.WebRequest.Create( s )

try
resp = req.GetResponse()
LinkStatus = Resp.StatusCode.ToString

resp.close()

catch exWeb As System.net.WebException

LinkStatus = exWeb.Message

end try

next i

'The preceding block will work because finding www.google.com 1000 times is
not time-consuming.



'But the next block tries to access a non-existent site. Even doing this
"only" 500 times causes the app to time out, evidently because it takes
longer to GetResponse() a non-existent site.


for i = 0 to 500

'This time we try a non-existent site
s = "www.google.edu"

req = System.Net.WebRequest.Create( s )
try
resp = req.GetResponse()
LinkStatus = Resp.StatusCode.ToString

resp.close()

catch exWeb As System.net.WebException

LinkStatus = exWeb.Message

end try

next i




So can anybody provide any pointers or documentation that would help me
solve this problem? I need the program to be able to handle large sets of
invalid URLs.

Thanks very much in advance,

Dave
 
Now, I have run into a problem that is difficult for me to solve. It is
this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.

The program will time out? What does that mean?

Regardless, the way I would approach this problem is by using multiple
threads. Create a queue of urls to be tested and and an empty output queue
of urls and their status (success, non-existent, timeout, whatever). Create
the proper mechanisms to thread-safely dequeue a url to be tested and enqueue
a url and its status when the url has been tested. Launch a number of
threads where the processing of each thread is as follows:

while true
if there are no urls left to test then return thereby ending the thread
dequeue a url to test
test the url
update the output queue with the result
end while

Your main program launches a number of these threads (10 is as good a
starting number as any) and waits until they are all completed. Each thread
will operate independently of the others. If one runs slowly because it ran
into a batch of invalid urls, others will run more quickly. The process will
bog down only when all threads are running slowly, in which case you should
launch more threads (ie if 10 is still too slow, then try 20). So long as at
least one thread is having good success testing urls, the entire process will
keep moving along.
 
By timeout, I get this message:

Description: An unhandled exception occurred during the execution of the
current web request. Please review the stack trace for more information
about the error and where it originated in the code.

Exception Details: System.Web.HttpException: Request timed out.

[HttpException (0x80004005): Request timed out.]




It's an interesting solution you lay out - if I don't see anything simpler,
I will give it a try. Statements like this, from Help, which I don't fully
understand, take some of my optimism away, though:
Thread Safety
An application must run in full trust mode when using serialization.



Anyway, thanks for your quick response.

Dave




AMercer said:
The program will time out? What does that mean?

Regardless, the way I would approach this problem is by using multiple
threads. Create a queue of urls to be tested and and an empty output
queue
[snip]
 
1. I don't know if it will make a difference or not, but I suggest you put
req = System.Net.WebRequest.Create( s )
inside the try block.

2. Re:
An application must run in full trust mode when using serialization.
Don't worry about this - serialization as used by MS means preserving an
object to a medium like a disk, and later, deserializing it refers to
creating a clone of the object from disk. It is a poor choice of words.
Before MS appropriated the word, to serialize used to mean to place in
sequence. In some computer science settings, it means enqueue.

3. Re threading... Assuming you have queues (System.Collections.Queue) for
use by the threads, the problem you need to solve is the contention problem.
You have to prevent two threads from updating a queue at the same time. Two
among many choices are:
Threading.Monitor.Enter(MyQueue)
... enqueue or dequeue here
Threading.Monitor.Exit(MyQueue)
and
SyncLock MyQueue
... enqueue or dequeue here
End SyncLock
Additionally, the .net queue object has property IsSynchronized which sounds
like it solves the contention problem, but I have no experience with it.

Good luck.
 
Okay, thanks again.

1) Actually, I have the req = ...Create(s) stmt in its own Try block since
some of the URLs are so poorly formed they create a URI error on that
statement. I left all that out for clarity.

2) Thanks for that good info. That is encouraging.

3) I will try probably try this, but I'm going to have to come back to it
since it looks like I need to teach myself multi-threading.

Anyway, thanks a lot for all this.

Dave
 
This solution was almost too simple.

Just added the following line in the Page_Load() sub, and it was done.

server.ScriptTimeout = 300 ' i.e. 5 minutes

Dave


Dave said:
Hello All,

A couple of weeks ago, I undertook to write a utility that would loop
through various URLs and test whether they were valid. I got some good
help from this list, and was able to write the utility.

Now, I have run into a problem that is difficult for me to solve. It is
this: When looping through a large set of URLs, if many of the URLS are
bad, the program will time out. Conversely, if most of the URLs are good,
it will perform as expected and complete.
[snip]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top