Gather Documentation

TooOld · Jul 16, 2013

MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level
docs downloaded to have at hand. Some kind of webcrawler or what is
recommended.

JJ · Jul 16, 2013

MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level
docs downloaded to have at hand. Some kind of webcrawler or what is
recommended.

You can use:

The Wayback Machine. http://wayback.archive.org/
To view cached pages of any websites as long as those websites allows it.
Old and new ones.

HTTrack software to download a website. Note: won't work for websites that
serve contents using JavaScript.

Paul · Jul 16, 2013

TooOld said:
MS has some great webpage HTML documents that they do not provide any
other way and the technology presented has been discontinued. So I
expect that in the near future those webpages will go away as many
others have. I want my docs. I paid for the apps the HTML documents
illuminate.
Also many times I am off line and have no way to get on line.
So .. what might be the best way to get all of those multiple level docs
downloaded to have at hand. Some kind of webcrawler or what is
recommended.

You can use www.archive.org, to look up old content.

The navigation features look like they switched to Flash
recently, so if the page is rendered really weird, try
enabling your Flash plugin.

As an example of content, here is Technet from the year 2001.

http://web.archive.org/web/20010202150000/http://www.microsoft.com/technet/default.asp

And if you have a URL like that, you can replace "20010202150000"
with the single character the asterisk "*", and that will take
you back to the navigation page.

If you have the Adobe Flash plugin installed, the navigation page
will look like this.

http://imageshack.us/a/img826/4769/txmh.gif

*******

Some web browsers, you can do "Save As" "Entire web page", and the
something.html file is stored on your disk, as well as a folder
called "something", which will contain all the graphics files
and so on. The two items (file and folder), constitute a complete
copy of the page. However, the original URL might have been
http:\\www.somesite.com\level1\level2\something.html and later,
you'll be left guessing where that "something.html" in your
download folder, came from. Doing it this way, is far from perfect.

If you wanted the entire site copied, that would take a tool
like the old Webwhacker. Some of the web sites, will have
anti-hammering protection, and prevent such a run from
completing. So there are no guarantees, for a large site,
that you will get the whole thing. This is an old tool, and
it is a wonder this page hasn't been deleted.

http://en.wikipedia.org/wiki/WebWhacker

The important thing to extract from that article, is terminology.
WebWhacker was referred to as an offline browser. It's also
known as "Website-mirroring software". The tool "WGET" gets
a mention here, but involves command line unless you can find
a GUI to run it with.

http://en.wikipedia.org/wiki/Offline_browser

http://en.wikipedia.org/wiki/Wget

Paul

TooOld · Jul 17, 2013

JJ presented the following explanation :

You can use:

The Wayback Machine. http://wayback.archive.org/
To view cached pages of any websites as long as those websites allows it.
Old and new ones.

HTTrack software to download a website. Note: won't work for websites that
serve contents using JavaScript.

I have used Wayback before. Unfortunately Wayback sometimes fails to
have the pages. I have found this too many times.

I tried HTTrack and it works well. I think I have captured what I need
for now.

Question1: I read in the docs that HTTrack can capture webpages as a
Browser is used to "link" around the live website. That would be
really helpful. But I cannot figure out how to do that.
Any help in this area would be greatly appreciated.

Question2: I see that on an early HTTrack attempt that I cancelled, the
"missing" pages in the mirror are redirected to the live site.
How do I stop the browser from going to the live site? Using IE.
I tried Chrome but Chrome does not play well.

However, optionally, I wonder if HTTrack would allow using the mirror
and downloading a missing webpage using the local HTTrack mirror.
That would also be very useful since I could add the desired missing
webpages to the mirror by "linking" the mirror to the live webpage.

Maybe pipe dreams but what the heck.

Word Corrupt Word Docs - cannot open	1	Apr 13, 2015
Don't put your pictures into "My Pictures"	8	Apr 1, 2009
RIP Photobucket.	7	Jul 5, 2017
Printing pages	1	Feb 13, 2009
How to Generate Yahoo-Style Directory Headings	7	Jan 21, 2004
Can I build a stand-alone system?	4	Mar 28, 2008
More problems with IDE and documentation	5	Dec 28, 2003
BBC iPlayer to stream World Cup 2018 matches in 4K HDR	1	Jun 7, 2018

Gather Documentation

TooOld

JJ

Paul

TooOld

Ask a Question

Similar Threads