REQ: CHM files

O

omega

[...]
I'm going to snip everything for this initial reply. In order to more
narrowly focus on describing the nature of my new uploads.


1. http://www.redshift.com/~omega/pw/pw2005nominations.zip (700k)

This is a direct httrack retrieval of the 2005 directory. I've made no
alterations, left it a hybrid of offline-online. That is, it contains
locally all the PW2005 program description pages, and similar content.
Then it also contains links that go online, to other parts of the PW
site.

I will be using the archive above during the next couple of weeks, as
a guide when mediating on my voting decisions. And maybe also as a ref
when I might want to engage in whichever topics might arise during the
post-voting discussion.

I think there would be a couple of other folks around, dialup users,
who might find it convenient to have the web archive contents of the
file above, stored locally on disk, for the period of the next couple
of weeks. I can keep it updated every 2 days or so, if it's the case
that anyone else has use for it.


2. http://www.redshift.com/~omega/pw/pw2003-04.zip (1.6mb)

This is an offline web archive of PL 2003 and PL 2004. They are separate
folders. I just zipped them together for convenience.

Susan: I was not up to starting over. On the two archives above, I just
took what I had already. The only change I made was to rename to your
suggestion. And, mainly, to fix the internal links. (As a workaround for
the problem that arose from getting the mixed file pairs after your bot
and mine had the odd result from their interaction.) These are offline
files. They also contain title-tags for the php generated pages, such
that all pages within would be conducive to providing unique entries
for CHM TOCs / HTML Indexes.

It might be that you are more interested in a more purely unaltered
mirror of the 2003 and 2004 directories? That's easy business for the
full site. It's when it concerns an offline archive for a particular part
of the site, then the end-form sought is a matter to dwell upon. I find
it confusing, from the perspective of using the file myself, to have the
PL links sometimes point online, and sometimes offline (thus my choice
on these to cut some headers and footers). OTOH, maybe those details
are unimportant for whichever other general purposes the archives would
serve.

Alternate plans, and uses, re archiving the PW 2003 and 2004 archives:
It's something to talk about. Just, as far as my doing that particular
project again immediately, I'd rather not, prefer take a nice long rest
first...


3. Your bot, and mine, and how they interact.

I appreciate the information you've provided, about how those pages get put
together. I'm pretty slow, as well as having zero experience or knowledge
with using a web server script. Although I do have some feeling on what's
goes on, due to your explanations. In conjunction with my observing the
httrack logs, and the content of the files. (Also, I did today take
advantage, and retrieve full directory lists, when you opened it up by
removing index.html files from the roots).

I thought it might turn out that you could want to check out my bot
first-hand?

You, or someone interested in getting involved, to help with the general
project of making chapters of the PWH site available in offline forms.

If you (or someone who has interest in helping) installs Httrack, and then
uses the zip file below I've created, that will provide things set to go for
a full mirror of the site.

http://www.redshift.com/~omega/pw/HttrackProjects.zip (2k)

Most of Httrack's settings work out of the box, but there are several
specific settings that are pertinent to a good mirroring of the PW site.
Originally I was going to post those briefly here. But this post has gone
already into what I'm guessing is the 100+ line range, so it's time to give
a break to attention span.

Btw, I'm hoping that I haven't made a fatal error here, that this doesn't
inspire 200 lurkers to go tromp across your server. I'd thought it over,
thought that took a while, but my best estimate ended that it would be okay
to talk in specifics about sending bots upon the site...
 
S

Susan Bugher

omega said:
What would be easiest for me to do at this stage, it's to give every file
in the 2004 archive an html extension, then do an SR of .htm for html.

I'll have make a similar "package" for the pages that go on the PL2005
CD - nice of you to do the trial run. :)

This is what I was kinda sorta thinking of doing (I've been mulling it
over somewhere in the back of what passes for my mind). . . When I save
a PHP file to my hard drive it's renamed from say:

PL2004MULTIMEDIA.php to PL2004MULTIMEDIA.php.htm

If I *remove* ".htm" from the end of the file name it works fine locally
(except for table sorting) and the *links* shown on the web pages work
for navigation. Sooooooo - the first thing I was going to try was
renaming the files *back* to just ".php" . . .

I assume ;) - that when HTTrack downloads a PHP page that has a table
that can be sorted it revises the table heading links (to point them to
the additional pages it creates). ISTM changing *just* the *first*
column header link might work. I've never used HTTrack so I'm guessing
the first header link is:

<TH><A HREF="PL2004ProgramIndex.html">Program</A></TH>

and suggesting that could be revised to:

<TH><A HREF="PL2004ProgramIndex.php">Program</A></TH>

I let you tell me where I've gone wrong. . . ;)

On a related note I managed to fumble and stumble :) my way through some
FTP revisions on the PWH site. I've uploaded the acf Treepad file and
your acf CHM files to a new location.

See: http://ftp.pricelesswarehome.org/
and: http://ftp.pricelesswarehome.org/Archives/

Apologies for doing the number of posts - periodically a little light
goes on and I think - oh, yeah - better post about that. . . ;)

Susan
 
S

Susan Bugher

omega said:
Susan Bugher <[email protected]>:

[...]
I'm going to snip everything for this initial reply.

I'm doing likewise.

Thanks Karen, I've downloaded your files and started looking at specifics.

I'm in the process of downloading HTTrack now (gonna play). I
*appreciate* all the help. Enjoy your well-deserved nap. :)

Susan
 
O

omega

Susan Bugher said:
[...]
I assume ;) - that when HTTrack downloads a PHP page that has a table
that can be sorted it revises the table heading links (to point them to
the additional pages it creates).

You assumption is accurate....

The end result, the table in the output pages:

<TH><A HREF="PL2004ProgramIndexCD-2.html">Program</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-3.html">Category</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-4.html">Author</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-5.html">PW_CD</A></TH>
[...]

The Httrack retrieval logs:

<remote path>/PL2004ProgramIndexCD.php ->
<local drive>/PL2004ProgramIndexCD.html

<remote path>/PL2004ProgramIndexCD.php?sortby=Program ->
<local drive>/PL2004ProgramIndexCD-2.html

<remote path>/PL2004ProgramIndexCD.php?sortby=Category ->
<local drive>/PL2004ProgramIndexCD-3.html

<remote path>/PL2004ProgramIndexCD.php?sortby=Author ->
<local drive>/PL2004ProgramIndexCD-4.html
[...]

(In the local mirror for Sorted pages from the PWH site, there's a
pattern of a pair of pages with about the same content, but each linked
separately. The first is eg PL2004ProgramIndexCD.html -- the page that's
loaded when you click the general link for "Program Index." The second
one is eg PL2004ProgramIndexCD-2.html, the page that's loaded by clicking
the "Sort by Program" column.)
ISTM changing *just* the *first* column header link might work. I've never
used HTTrack so I'm guessing the first header link is:

<TH><A HREF="PL2004ProgramIndex.html">Program</A></TH>

and suggesting that could be revised to:

<TH><A HREF="PL2004ProgramIndex.php">Program</A></TH>

I'm not getting the context, for where you would need filenames renamed to
extension php? Are you up to re-explaining?

Btw, to do an Httrack retrieval of specifically the CD pages. It's first
choosing an appropriate Start Page URL.

http://www.pricelesswarehome.org/2004/CD2004PL.htm

It's next setting the filters:

SCAN RULES
-*
+http://www.pricelesswarehome.org/2004/*CD*
+http://www.pricelesswarehome.org/2004/*.gif
+http://www.pricelesswarehome.org/2004/*.jpg
+dvdsig.md5

That first filter, the kill-all, it's usually the best start. Then you look
for filename and directory patterns, pertinent to what you want. The end
result here is that it gives about eight html pages, mainly the sorted table
ones, and the images. Then the links pertaining to other items on the site,
those point online.

But I've meandered off. It's because I didn't grasp the context where it
would be wanted to rename files from html to php...(?)
 
O

omega

Susan Bugher said:
I'm in the process of downloading HTTrack now (gonna play). I
*appreciate* all the help. Enjoy your well-deserved nap. :)

Thanks! :)

How's my favorite pet bot doing for you? Have you worked out the commands
to tell him to go hunt wabbits?

Btw, I hope that you were adept enough to cope with the very sloppy pair
of text notes I'd put in the Httrack project zip thing.

The one named "readme.txt," I later realized it should've been named
"dont-readme." (The first sentence was fine. But then after that, it
was a somewhat incoherent - and unnecessary - rambling about httrack local
directory structure for its setting files on particular download projects.)

And the other text note, the elliptical outline of a few of the settings
that I viewed as relevant. These were all the settings that were already
applied in the pwhome.whtt/winprofile.ini, but I didn't make that clear...
 
S

Susan Bugher

omega said:
Susan Bugher <[email protected]>:

I'm not getting the context, for where you would need filenames renamed to
extension php? Are you up to re-explaining?

Try two:

Changing file extensions is easy, changing the links within the web
pages is a PITA.

Option 1. Don't use HTTrack.

*Save* all the PL pages. Remove .html from the end of the PHP file names
that have a double extension -> change *.php.html back to *.php (the
original file name). The other files *.htm and Index.html don't need
*any* changes. The files will work fine locally. PHP tables *won't* be
sortable. Updating is easy, just save the new page(s).

Option 2. Use HTTrack to do the same thing for you *and* save additional
page so that the PHP tables will sort.
That first filter, the kill-all, it's usually the best start.

That suggestion *alone* was worth it's weight in gold. :)
Then you look
for filename and directory patterns, pertinent to what you want.

Yup, that took a few tries. PL2003 and PL2005 are easy - PL2003 uses
mostly .htm pages, PL2005 uses .php pages. PL2004 is tough because it
has some of each (I justed listed the .htm pages I wanted). I didn't ask
for image files - they could be added to the lists.

One key step is going to the "Experts Only" option tab and choosing
"Original URL / Original URL" - so that the links *aren't* changed. FWIW
- this is what I came up with. . .

---------
PL2003 site: http://www.pricelesswarehome.org/2003/
SCAN RULES
-*
+http://www.pricelesswarehome.org/2003/*.php*
+http://www.pricelesswarehome.org/2003/*.htm*
----------
PL2004 site: http://www.pricelesswarehome.org/2004/
SCAN RULES
-*
+http://www.pricelesswarehome.org/2004/*.php*
+http://www.pricelesswarehome.org/2004/2004proceduresPL.htm
+http://www.pricelesswarehome.org/2004/PL2004nCategoryIndex.htm
+http://www.pricelesswarehome.org/2004/PL2004nBUSINESS-HOME.htm
+http://www.pricelesswarehome.org/2004/PL2004nDESKTOP.htm
+http://www.pricelesswarehome.org/2004/PL2004nFILEUTILITIES.htm
+http://www.pricelesswarehome.org/2004/PL2004nGRAPHICS.htm
+http://www.pricelesswarehome.org/2004/PL2004nINTERNET.htm
+http://www.pricelesswarehome.org/2004/PL2004nMULTIMEDIA.htm
+http://www.pricelesswarehome.org/2004/PL2004nORGANIZERS.htm
+http://www.pricelesswarehome.org/2004/PL2004nPROGRAMMING.htm
+http://www.pricelesswarehome.org/2004/PL2004nSECURITY.htm
+http://www.pricelesswarehome.org/2004/PL2004nSYSTEMUTILITIES.htm
+http://www.pricelesswarehome.org/2004/PL2004nTEXT.htm
+http://www.pricelesswarehome.org/2004/PL2004nWEBDESIGN.htm
---------
PL2005 site: http://www.pricelesswarehome.org/2005/
SCAN RULES
-*
+http://www.pricelesswarehome.org/2005/*.php*
--------
acf site: http://www.pricelesswarehome.org/acf/
SCAN RULES
-*
+http://www.pricelesswarehome.org/acf/Members.php*
+http://www.pricelesswarehome.org/acf/WareGlossary.php*
-----------

Note: Notice how easy it is to do PL2005. :)

+http://www.pricelesswarehome.org/2005/*.php*
fetches all the *sets* of pages (tables can be sorted).

+http://www.pricelesswarehome.org/2005/*.php
fetches only one of each (no sorting capability but fewer pages).

More later. Thanks again for the help. :)

Susan
 
S

Susan Bugher

omega said:
How's my favorite pet bot doing for you? Have you worked out the commands
to tell him to go hunt wabbits?

Btw, I hope that you were adept enough to cope with the very sloppy pair
of text notes I'd put in the Httrack project zip thing.

They were very helpful after I spent time with the app and had some
inkling of what you were talking about. I read the notes, played with
the app, read the notes, played with the app. . . ;)

Susan
 
S

Susan Bugher

Susan said:
acf site: http://www.pricelesswarehome.org/acf/
SCAN RULES
-*
+http://www.pricelesswarehome.org/acf/Members.php*
+http://www.pricelesswarehome.org/acf/WareGlossary.php*
-----------

Note: Notice how easy it is to do PL2005. :)

+http://www.pricelesswarehome.org/2005/*.php*
fetches all the *sets* of pages (tables can be sorted).

+http://www.pricelesswarehome.org/2005/*.php
fetches only one of each (no sorting capability but fewer pages).
More later. Thanks again for the help. :)

Hmmm, but today it's working differently. I got all the "sets" of pages
even though I omitted the asterisk. ISTM I did the same thing several
times yesterday and today and got varying results - thought it was me at
first, now I'm wondering. . . perhaps the setting needs to be changed
to a specific PHP version or maybe there's a bug in the app.

I found some problems in the "extra" pages of the first web page I
checked - the sorts were not correct in all of them (the online sort is
okay). Some search and replace work is needed in the table headings if
I use the original (local) links. Haven't figured a way around that yet.

Rats and phooey. . . :(

soooooooo. . . I made some .zip files that *don't* have sortable tables
and uploaded them:

http://ftp.pricelesswarehome.org/Archives/

The Pricelessware archives (2003, 2004, 2005) contain all pages - the
acf.zip archive has just a few web pages (more can be added by saving
the online pages).

This is a trial run (for the PW2005 CD web pages). If there are any
problems please let me know.

Susan
 
O

omega

Susan Bugher <[email protected]>:

Susan, I just found a setting that you might (?) be interested in testing.

Options > Spider

"Check Document Type."

Change the drop-down from "if unknown except /" to "never."

| Define when the engine has to check document type
|
| The engine must know the document type, to rewrite the file types.
| For example, if a link called /cgi-bin/gen_image.cgi generates a gif
| image, the generated file will not be called "gen_image.cgi" but
| "gen_image.gif"

When I tested just now it wrote filename extensions .php to disk, and
then the local table links like this:

local:
<TH><A HREF="PL2004ProgramIndexCD-2.php?sortby=Program">Program</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-3.php?sortby=Category">Category</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-4.php?sortby=Author">Author</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-5.php?sortby=PW_CD">PW_CD</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-6.php?sortby=Ware_type">Ware_type</A></TH>
<TH><A HREF="PL2004ProgramIndexCD-7.php?sortby=DescRev">DescRev</A></TH>

At least that makes the local and the online URLS more parallel. If
that helps (?)

online:
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Program">Program</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Category">Category</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Author">Author</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=PW_CD">PW_CD</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Ware_type">Ware_type</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=DescRev">DescRev</A></TH>
 
O

omega

Susan Bugher said:
Hmmm, but today it's working differently. I got all the "sets" of pages
even though I omitted the asterisk. ISTM I did the same thing several
times yesterday and today and got varying results - thought it was me at
first, now I'm wondering. . .

I took a look at that. I don't have a guess about your earlier results.
But can say the strategy that I go for. To only get the first php-
generated page, and not let Httrack pursue the other sorted-by pages,
add a kill filter against the query string:

-*
+http://www.pricelesswarehome.org/2005/*.php
-http://www.pricelesswarehome.org/2005/*sort*
 
O

omega

Susan Bugher said:
http://ftp.pricelesswarehome.org/Archives/

The Pricelessware archives (2003, 2004, 2005) contain all pages - the
acf.zip archive has just a few web pages (more can be added by saving
the online pages).

This is a trial run (for the PW2005 CD web pages). If there are any
problems please let me know.

IMO you might consider revision of the readme. It says that a user could
use their browser, do a save-as, and add that page into an offline archive.
Consider the fact that pages saved from the browser, they have absolute
URLs. Consequently, such pages would be orphans, linked to no other pages
in the local archive.
 
O

omega

Susan Bugher said:
I found some problems in the "extra" pages of the first web page I
checked - the sorts were not correct in all of them (the online sort is
okay).

When I click around on the pages in the PL2005 directory that Httrack
retrieved, the sorts seem fine. Do you have record on which pages (or
PWH URLs) you were seeing problem results?

The case where the sorts would not work offline at all...

That would be those in archives created during the experiment of having
"Original URL" written to disk. Lacking a script to give orders, these
links just cause loading one particular html page.

<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Category">Category</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=Author">Author</A></TH>
<TH><A HREF="PL2004ProgramIndexCD.php?sortby=PW_CD">PW_CD</A></TH>
 
O

omega

[ADD]
Susan, I just found a setting that you might (?) be interested in testing.

Options > Spider

"Check Document Type."

Change the drop-down from "if unknown except /" to "never."

| Define when the engine has to check document type
|
| The engine must know the document type, to rewrite the file types.
| For example, if a link called /cgi-bin/gen_image.cgi generates a gif
| image, the generated file will not be called "gen_image.cgi" but
| "gen_image.gif"

I forgot the other part. Also on the Spider tab: "Force Old HTTP/1.0
requests."

Neither of these settings are good for normal defaults. But they seem fine
for this project. It's what gets the php filename extension that I thought
you wanted. And it also causes the write of a string into the hrefs that has
closer resemblance to an online counterpart.
 
O

omega

[Edit, 3]

re, httrack, and getting filenames.php names, plus getting query strings
written into the hrefs:
I forgot the other part. Also on the Spider tab: "Force Old HTTP/1.0
requests."

Heck. There were three parts altogether. During my initial round of
testing, I'd made the mistake of not taking notes, and relying on my
sub-memory instead. Trodging up from the rear, here is the third of it:

Options > Mime Associations
File Types > // (remove php from here)

When you're changing settings like this, you're doing them only for
the particular project (the one denoted by the someproject.whtt file).
Not making global changes. The settings above would not be good global
ones, since other php scripts on servers will be different; and also
would probably filenames.html for other projects.

To save settings from one project, for use in another, you can export/
import an options (.opt) file. That's done under the Preferences menu:
Save Options As / Load Options. The Preferences Menu also provides
commands to set global options. Plus it has a get-outta-jail-free card,
to reset back to original defaults.
 
S

Susan Bugher

omega said:
[Edit, 3]

re, httrack, and getting filenames.php names, plus getting query strings
written into the hrefs:
I forgot the other part. Also on the Spider tab: "Force Old HTTP/1.0
requests."

Heck. There were three parts altogether. During my initial round of
testing, I'd made the mistake of not taking notes, and relying on my
sub-memory instead. Trodging up from the rear, here is the third of it:

Options > Mime Associations
File Types > // (remove php from here)

Thanks. :) One or more of those should do the trick.
To save settings from one project, for use in another, you can export/
import an options (.opt) file. That's done under the Preferences menu:
Save Options As / Load Options. The Preferences Menu also provides
commands to set global options. Plus it has a get-outta-jail-free card,
to reset back to original defaults.

Yup - figured that part out when I went back and forth between the app
and your notes.

Susan
 
S

Susan Bugher

Consider the fact that pages saved from the browser, they have absolute
URLs.

HUH ???????????????? Only if you *change* them. PWH site navigation
links are *relative*. The *only* "hard" link is the "Pricelessware Home"
link.

Susan
 
O

omega

Susan Bugher said:
Thank you again. :)

Btw, Httrack filters are processed sequentially. Therefore one might
have a filter that works fine like this:

-*
+somepath/*.jpg
-somepath/th*.jpg

But if those lines were in a different sequence, then it wouldn't work.
Eg would get all those jpgs, and not exclude the thumbs, if those two
last two lines were reversed.

Likely you'd already guessed the sequential nature of the filters
processing. Just wanted to be sure to mention, for bases covered.
 
O

omega

Susan Bugher said:
HUH ???????????????? Only if you *change* them. PWH site navigation
links are *relative*. The *only* "hard" link is the "Pricelessware Home"
link.

In the readme file, you said for a user to save a page from their
browser, and then put it in their local archive, right?

When they do that, then the browser will insert the whole path.
http:\\www.pricelesswarehome.org\...

The only time the browser writes relative paths is when it makes a folder
for images (and similar, like css and js). But at that time it invents its
own folder name for those.

Back to when the browser writes the whole path. If they then use that
locally saved-file to click on a link related to the PWH site, then it
will take them online. It will not link to relative pages in the archive.
 
S

Susan Bugher

omega said:
In the readme file, you said for a user to save a page from their
browser, and then put it in their local archive, right?

When they do that, then the browser will insert the whole path.
http:\\www.pricelesswarehome.org\...

Mozilla doesn't do that. What browser are you talking about?

Susan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top