Downloading WebSites using HttpWebRequest

T

thomas peter

I am building a precache engine... one that request over 100 pages on an
remote server to cache them remotely...
can i use the HttpWebRequest and WebResponse classes for this? or must i use
the MSHTML objects to really load the HTML and request all of the images on
site?

string lcUrl = http://www.cnn.com;

// *** Establish the request

HttpWebRequest loHttp =

(HttpWebRequest) WebRequest.Create(lcUrl);

// *** Set properties

loHttp.Timeout = 10000; // 10 secs

loHttp.UserAgent = "Code Sample Web Client";

// *** Retrieve request info headers

HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();

Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page

StreamReader loResponseStream =

new StreamReader(loWebResponse.GetResponseStream(),enc);

string lcHtml = loResponseStream.ReadToEnd();

loWebResponse.Close();

loResponseStream.Close();
 
S

Steven Cheng[MSFT]

Hi Thomas,

As for the request and cache remote pages question, I think the
HttpWebRequest is capable of handling this. We can use HttpWebRequest to
send request to a certain url and get it's response stream, thus, we can
store the response result(Html or anyother mime type) into the persistence
medium we want , for example, file system, memory ,database or ...

And the MSHTML components are the components library that help to
progrmatically process the certain web page's response as a Document(DOM
structure) , just like what we can do in a web browser. If we just want to
get the response result (the html ouput or file stream), the HttpWEbRequest
is enough and the MSHTML is not necessary.
In addition, here are some tech articles on using the HttpWebRequest to
request web resources:

#Accessing Web Sites Using Desktop Applications
http://www.devsource.ziffdavis.com/print_article/0,2043,a=119849,00.asp

#Crawl Web Sites and Catalog Info to Any Data Store with ADO.NET and Visual
Basic .NET
http://msdn.microsoft.com/msdnmag/issues/02/10/spiderinnet/

Hope also helps. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx
 
T

Thomas Peter

Thanks Steven,

I need to make sure that i am remotely caching all of the html including all
pitcures... hence i figured a simple WebRequest wont do...
so i am trying to get the GetResponseStream() into an HTMLDocument object to
ensure that the entire site loads...
But
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp; // ??? how do i get the response stream
into/as htmldocument?

Any ideas?







///--------------- Full example

HttpWebRequest request = (HttpWebRequest)WebRequest.Create
(http://www.microsoft.com);

request.MaximumAutomaticRedirections = 4;

request.MaximumResponseHeadersLength = 4;


HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

Console.WriteLine ("Content length is {0}", response.ContentLength);

Console.WriteLine ("Content type is {0}", response.ContentType);

Stream receiveStream = response.GetResponseStream ();

StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);

string tmp = readStream.ReadLine();

HTMLDocument htmlDoc = new HTMLDocumentClass();

htmlDoc = (HTMLDocument) tmp;


response.Close ();

readStream.Close ();
 
T

Thomas Peter

it now appears that i cannot use HttpWebRequest because i need to be able to
specify the Host Header.... and HttpWebRequest.Headers HOST is set by the
system to the current host information and now way for me to modify it..

I need to retrive webpages for the remote server to cache it... any ideas?
 
S

Steven Cheng[MSFT]

Hi Thomas,

Thanks for your followup. Based on my experience, since you want to request
the page and retrieve it's reponse stream and load it into the HTMLDocument
to process it. I think you can consider using the WEbBrowser control to do
the task. You can use WebBrowser control to navigate a certain web resource
and when the page is loaded, it'll automatically be loaded into a Document
object.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx
 
T

Thomas Peter

Cant use webbrowser because application must be a webapplication....

I dropped HTTPWebRequest/Response methods and opted for MSXML2, But does
ServerXMLHTTP open support different ports?
MSXML2.ServerXMLHTTPClass();
 
T

Thomas Peter

Different Websites sharing same IP's example

microsoft.com and abc.com both on server 207.71.34.12

require host header to specify desired site
 
S

Sunny

So,
are you saying that:

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

and

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://abc.com/");

both create one and the same HttpWebRequest object, and you need to fix
the HOST header?

In my tests, the correct header is created, so still I'm wondering why
you can not use HttpWebRequest for your task.

I have created in the past a very basic web spider, which uses
HttpWebRequest, the creates a MSHTMLDocument document with the content
fetched, and then I was able to iterate and download all links and
pictures.


Sunny
 
T

Thomas Peter

Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP address,
usually going thru DNS its no problem... but i need to be able to directly
access a site...
example above: in order for me to get the correct site i must also supply
the microsoft.com host header value or abc.com host header value.

It appears that one cannot modify certain Headers in HttpWebRequest and Host
is one of them.

Be a hero and share your spider code ;0) i am working on something
similar...
 
S

Sunny

Hi Thomas,
(inline)

Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP address,
usually going thru DNS its no problem... but i need to be able to directly
access a site...
example above: in order for me to get the correct site i must also supply
the microsoft.com host header value or abc.com host header value.

I was confused, that you rejected HttpWebRequest from using only based
on the fact that you can not modify HOST header. That's why I asked the
question :)
I can not see a reason why you would like to do this. If you already
know what you want to put in that header, you just have to create the
right HttpWebRequest object. Or I'm missing something?
It appears that one cannot modify certain Headers in HttpWebRequest and Host
is one of them.

There a lot of things in the framework which are made by a purpose, and
a lot are not :). But especially for that header, I do not see a reason
to be exposed as I said before.
Be a hero and share your spider code ;0) i am working on something
similar...

Unfortunately, I can share only a small part of the code. I'll post it
later.


Sunny
 
T

Thomas Peter

Sunny,

You got my attention (inline)

Sunny said:
Hi Thomas,
(inline)



I was confused, that you rejected HttpWebRequest from using only based
on the fact that you can not modify HOST header. That's why I asked the
question :)
I can not see a reason why you would like to do this. If you already
know what you want to put in that header, you just have to create the
right HttpWebRequest object. Or I'm missing something?

Am i missing something? How do i do this? I know what i want to put in that
header... i just want to specify the HOST HEADER value but i dont think
thats possible



There a lot of things in the framework which are made by a purpose, and
a lot are not :). But especially for that header, I do not see a reason
to be exposed as I said before.


Unfortunately, I can share only a small part of the code. I'll post it
later.

I will do the same
 
S

Steven Cheng[MSFT]

Hi Thomas,

As Sunny has mentioned, when we request some certain sites distinguished
via host header, we can just create the HttpWebREquest object by the
certain specified url(with hostheader) and the serverside can correctly
router the request according to the host header in the url.
In addition, as for the webbrowser control, we can use it in web
application. For example, we can create a winform control which use the
webbrowser control and then embeded the winform control in web page.(IE
support embeded winform control which run at the clientside 's CLR).

Anyway, I think we can first have a look at Sunny's suggestion. Thanks.

Regards,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)

Get Preview at ASP.NET whidbey
http://msdn.microsoft.com/asp.net/whidbey/default.aspx
 
S

Sunny

Hi Thomas,

pls, read inline:
Am i missing something? How do i do this? I know what i want to put in that
header... i just want to specify the HOST HEADER value but i dont think
thats possible

And from your pev. post:
I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com

So, if you know the domain name (microsoft.com or abc.com) and want to
put it in the header, then why you just not create

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

This way the HOST header will be set correctly.

That was my point, if you know with what you want to change the HOST
header, I.e. you know the domain, you can easily just create a new
HttpWebRequest with that domain.

I do not understand why someone would like to do this (pseudocode):

HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

myReq.HostHeader = "microsoft.com"; //this not work


Why not to directly create the webrequest against the known domain?

Sunny
 
S

Sunny

Hmm, something happened with the attachement.
I copy/paste it here, watch for line wraps.

public void GetItem()
{
if (this.link.IsImage)
this.GetImage();
else
this.GetPage();
}

private void GetImage()
{
System.Net.WebClient source = new System.Net.WebClient();
Stream myData = null;
FileStream myFile = null;
FileInfo filename = new FileInfo("c:\myworkfolder" + @"\" +
this.link.Subs);

try
{
byte[] buffer = new byte[4096];

myData = source.OpenRead(this.link.Orig);
myFile = new FileStream(filename.FullName,
FileMode.Create);

int br;
do
{
br = myData.Read(buffer, 0, buffer.Length);
if (br > 0)
myFile.Write(buffer, 0, br);
}
while (br > 0);
myFile.Close();

myData.Close();
this.link.IsRead = true;
}
finally
{
if (myData != null)
myData.Close();
if (myFile != null)
myFile.Close();
if (filename.Exists)
{
try {filename.Delete()};
catch{}
}
}
}

private void GetPage()
{
System.Net.WebClient source = new System.Net.WebClient();
StreamReader mr = null;
string sWebPage = String.Empty;

try
{
mr = new StreamReader(source.OpenRead(this.link.Orig));
sWebPage = mr.ReadToEnd();
}
finally
{
if (mr != null)
mr.Close();
}

HTMLDocumentClass myDoc;

try
{
object[] oPageText = {sWebPage};
myDoc = new HTMLDocumentClass();
IHTMLDocument2 oMyDoc = (IHTMLDocument2)myDoc;
oMyDoc.write(oPageText);
}
catch
{
//page is not well formated, skip it
return;
}

// if we are here, we have readed the page and we are ready to
parse it

IHTMLElementCollection cMyLinks = (IHTMLElementCollection)
myDoc.links;

foreach (IHTMLAnchorElement oLink in cMyLinks)
oLink.href = this.SubstituteTags(true, this.link.Orig,
oLink.href, false);
//SubstituteTags method changes the <href> tag to the
filename
//in which we'll save the link, so page is ready for
offline viewing
//and it also adds the link in the queue of the pages to be
//processed

cMyLinks = (IHTMLElementCollection)myDoc.images;
foreach (IHTMLImgElement oImage in cMyLinks)
oImage.src = this.SubstituteTags(false, this.link.Orig,
oImage.href, false);

StreamWriter myFile = null;
sWebPage = myDoc.documentElement.outerHTML;
this.link.IsRead = true;
try
{
myFile = new StreamWriter(oParent.WriteFolder + @"\" +
this.link.Subs, false);
myFile.Write(sWebPage);
}
finally
{
if (myFile != null)
myFile.Close();
}
}
 
J

Joerg Jooss

Thomas said:
Sunny,

I am saying that HttpWebRequest myReq =
(HttpWebRequest)WebRequest.Create("http://microsoft.com/");

works great if you have a domain name... what about

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for microsoft.com and

(HttpWebRequest)WebRequest.Create(http://207.71.134.23);

for abc.com, quite common for multiple sites to be sharing 1 IP
address, usually going thru DNS its no problem... but i need to be
able to directly access a site...
example above: in order for me to get the correct site i must also
supply the microsoft.com host header value or abc.com host header
value.

It appears that one cannot modify certain Headers in HttpWebRequest
and Host is one of them.

Be a hero and share your spider code ;0) i am working on something
similar...

Lets not confuse things here. A spider is just a special purpose web
*client*. It does not relay requests like a proxy. I'm not sure what you're
trying to build -- a true caching proxy or simply some sort of spider or web
leech? If it's a proxy, you will need to be able to set "Host" indepedently
of the destination address -- HttpWebRequest won't work here. As a web
client that should never be the case -- unless you've got some nasty user
who prefers to address multihome servers by IP address and Host header ;-)

Cheers,
 
T

Thomas Peter

Thanks Sunny,

I'll look over your code and will touch base with you shortly...

thanks again,,, i am really excited about this now

~Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top