Saving a web page

alexey_r · Jul 3, 2006

Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
clear enough.

But unless I am missing something, this will only give me the html
source of the webpage requsted, and not all the images, stylesheets and
so on. Is there a simple way to get the entire webpage?

The alternatives I see now:
Get a WebBrowser in background to do it, but this seems very nasty.
There _has_ to be a better way. Besides, how can I select the correct
file type and enter the name in the backgound?
Interop with mshtml.dll. See above.
After getting the html file, I could iterate through the images, etc.
to request all of them separately.

Thank you in advance!

Andy · Jul 3, 2006

You'll have to get the img tags and download them manually; basically,
write some code which normally a browser would do.

So, parse the <img> tags (and <a> tags, if you like), then use
HttpRequest to get the images.

HTH
Andy

Tom Spink · Jul 3, 2006

Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
clear enough.

But unless I am missing something, this will only give me the html
source of the webpage requsted, and not all the images, stylesheets and
so on. Is there a simple way to get the entire webpage?

The alternatives I see now:
Get a WebBrowser in background to do it, but this seems very nasty.
There _has_ to be a better way. Besides, how can I select the correct
file type and enter the name in the backgound?
Interop with mshtml.dll. See above.
After getting the html file, I could iterate through the images, etc.
to request all of them separately.

Thank you in advance!

Hi,

Unfortunately, there isn't a simple way. The way web-browsers (usually)
work is that they start rendering the page, and download the
images/stylesheets/whatnot as they need them. They're parsing the HTML,
finding an <img> tag, or a <link> tag and deciding to download the file
that the tag is referencing.

You'll need to do this; i.e. analyse the HTML you've received, and decide
what needs to be downloaded by looking at the tags.

Michael Nemtsev · Jul 3, 2006

Hello (e-mail address removed),

I'd save page into MHT (web archive) and then parse it to get images
BTW images are encoded in the MHT

PS: This lib could be used for parsing http://www.codeproject.com/csharp/mime_project.asp

Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
clear enough.

But unless I am missing something, this will only give me the html
source of the webpage requsted, and not all the images, stylesheets
and so on. Is there a simple way to get the entire webpage?

The alternatives I see now:
Get a WebBrowser in background to do it, but this seems very nasty.
There _has_ to be a better way. Besides, how can I select the correct
file type and enter the name in the backgound?
Interop with mshtml.dll. See above.
After getting the html file, I could iterate through the images, etc.
to request all of them separately.
Thank you in advance!

---
WBR,
Michael Nemtsev :: blog: http://spaces.msn.com/laflour

"At times one remains faithful to a cause only because its opponents do not
cease to be insipid." (c) Friedrich Nietzsche

alexey_r · Jul 4, 2006

Michael said:
Hello (e-mail address removed),

I'd save page into MHT (web archive) and then parse it to get images
BTW images are encoded in the MHT

Ah, thank you. But how do I save it as MHT?

alexey_r · Jul 4, 2006

Tom said:
Hi,

Unfortunately, there isn't a simple way. The way web-browsers (usually)
work is that they start rendering the page, and download the
images/stylesheets/whatnot as they need them. They're parsing the HTML,
finding an <img> tag, or a <link> tag and deciding to download the file
that the tag is referencing.

You'll need to do this; i.e. analyse the HTML you've received, and decide
what needs to be downloaded by looking at the tags.

Thank you.

Michael Nemtsev · Jul 4, 2006

Hello (e-mail address removed),

http://groups.google.com/groups/search?q=dotnet+save+mht

Ah, thank you. But how do I save it as MHT?

---
WBR,
Michael Nemtsev :: blog: http://spaces.msn.com/laflour

"At times one remains faithful to a cause only because its opponents do not
cease to be insipid." (c) Friedrich Nietzsche

alexey_r · Jul 4, 2006

Michael said:
Hello (e-mail address removed),

http://groups.google.com/groups/search?q=dotnet+save+mht

Thank you again! Looks like it won't work for websites protected by
password, so I am back to plan A.

Michael Nemtsev · Jul 4, 2006

Hello (e-mail address removed),

What does "websites protected by password"?
Any example?
Have you tried to save that sites to MHT via IE?

Thank you again! Looks like it won't work for websites protected by
password, so I am back to plan A.

---
WBR,
Michael Nemtsev :: blog: http://spaces.msn.com/laflour

"At times one remains faithful to a cause only because its opponents do not
cease to be insipid." (c) Friedrich Nietzsch

Is this HttpWebRequest correct?	11	Oct 3, 2008
HTTPWebRequest not working with Wikipedia	6	Mar 10, 2008
how do I input costum content into the Winforms.WebBrowser?	5	Jul 27, 2005
HttpWebResponse seems to hang	3	Oct 9, 2005
Read contents of a web page	4	Oct 24, 2006
Accessing a Mediaplayer on a Webpage on a Form	1	Sep 17, 2007
how to get a web page	8	Jul 19, 2005
Using a System.Windows.Forms.WebBrowser inside ASP web form	1	Nov 5, 2008

Saving a web page

alexey_r

Andy

Tom Spink

Michael Nemtsev

alexey_r

alexey_r

Michael Nemtsev

alexey_r

Michael Nemtsev

Ask a Question

Similar Threads