Gather Documentation

T

TooOld

MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level
docs downloaded to have at hand. Some kind of webcrawler or what is
recommended.
 
J

JJ

MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level
docs downloaded to have at hand. Some kind of webcrawler or what is
recommended.

You can use:

The Wayback Machine. http://wayback.archive.org/
To view cached pages of any websites as long as those websites allows it.
Old and new ones.

HTTrack software to download a website. Note: won't work for websites that
serve contents using JavaScript.
 
P

Paul

TooOld said:
MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level docs
downloaded to have at hand. Some kind of webcrawler or what is
recommended.

You can use www.archive.org, to look up old content.

The navigation features look like they switched to Flash
recently, so if the page is rendered really weird, try
enabling your Flash plugin.

As an example of content, here is Technet from the year 2001.

http://web.archive.org/web/20010202150000/http://www.microsoft.com/technet/default.asp

And if you have a URL like that, you can replace "20010202150000"
with the single character the asterisk "*", and that will take
you back to the navigation page.

If you have the Adobe Flash plugin installed, the navigation page
will look like this.

http://imageshack.us/a/img826/4769/txmh.gif

*******

Some web browsers, you can do "Save As" "Entire web page", and the
something.html file is stored on your disk, as well as a folder
called "something", which will contain all the graphics files
and so on. The two items (file and folder), constitute a complete
copy of the page. However, the original URL might have been
http:\\www.somesite.com\level1\level2\something.html and later,
you'll be left guessing where that "something.html" in your
download folder, came from. Doing it this way, is far from perfect.

If you wanted the entire site copied, that would take a tool
like the old Webwhacker. Some of the web sites, will have
anti-hammering protection, and prevent such a run from
completing. So there are no guarantees, for a large site,
that you will get the whole thing. This is an old tool, and
it is a wonder this page hasn't been deleted.

http://en.wikipedia.org/wiki/WebWhacker

The important thing to extract from that article, is terminology.
WebWhacker was referred to as an offline browser. It's also
known as "Website-mirroring software". The tool "WGET" gets
a mention here, but involves command line unless you can find
a GUI to run it with.

http://en.wikipedia.org/wiki/Offline_browser

http://en.wikipedia.org/wiki/Wget

Paul
 
T

TooOld

JJ presented the following explanation :
You can use:

The Wayback Machine. http://wayback.archive.org/
To view cached pages of any websites as long as those websites allows it.
Old and new ones.

HTTrack software to download a website. Note: won't work for websites that
serve contents using JavaScript.

I have used Wayback before. Unfortunately Wayback sometimes fails to
have the pages. I have found this too many times.

I tried HTTrack and it works well. I think I have captured what I need
for now.

Question1: I read in the docs that HTTrack can capture webpages as a
Browser is used to "link" around the live website. That would be
really helpful. But I cannot figure out how to do that.
Any help in this area would be greatly appreciated.

Question2: I see that on an early HTTrack attempt that I cancelled, the
"missing" pages in the mirror are redirected to the live site.
How do I stop the browser from going to the live site? Using IE.
I tried Chrome but Chrome does not play well.

However, optionally, I wonder if HTTrack would allow using the mirror
and downloading a missing webpage using the local HTTrack mirror.
That would also be very useful since I could add the desired missing
webpages to the mirror by "linking" the mirror to the live webpage.

Maybe pipe dreams but what the heck.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top