any freeware apps that'll generate an index of words on html pages?

D

dave

hoping to find a freeware app that'll read' (hundreds of) my web pages
(online or off, doesn't matter) and generate an index of ALL words used
on the site. mainly from (or entirely from) the item descriptions and/or
photo captions.

something like "word 'steel' used fifty-eight times, occurs on these
pages, a href-A, B, C, D etc etc". results organized by the app in ANY
'human useable' fashion would be just fine...(preferably alphabetical,
would help some.. :)

i realize it'd index ALL words, then I'd just edit out the ones I can't
use for my purposes...no big deal.

also, wife has microsoft word, the 'biggie version' - can *it* read html
pages and do that type indexing? can it 'root down thru' subdirectories
and do them all at the same time? how about if I converted all the html
pages to 'plain text' first - could it do it then? I'm not at all sure
what an app like this would even be called...an indexer?

thanks much for tips, clues, pointers, and ideas,

toolie
 
S

Susan Bugher

dave said:
hoping to find a freeware app that'll read' (hundreds of) my web pages
(online or off, doesn't matter) and generate an index of ALL words used
on the site. mainly from (or entirely from) the item descriptions and/or
photo captions.

something like "word 'steel' used fifty-eight times, occurs on these
pages, a href-A, B, C, D etc etc". results organized by the app in ANY
'human useable' fashion would be just fine...(preferably alphabetical,
would help some.. :)

i realize it'd index ALL words, then I'd just edit out the ones I can't
use for my purposes...no big deal.

Hi toolie, take a look at these two apps.

Program: TextStat
Author: Lionel Allorge
Ware: (Freeware) (open source)
http://www.lunerouge.org/

http://www.lunerouge.org/spip/article.php3?id_article=443

<q>
Analysis of a text:

This program reads the text files and HTML files. If your document is in another format, you must
export it towards the text format (text only) or HTML format.

You launch the TextStat program. In the zone "File to stat", you enter the name of your text file.
The button above allows you to choose a file. You can specify if the text is in HTML in which case
HTML tags are ignored.

In the zone "File for results", you enter the name of a file which will contain the whole
statistics. The button above allows you to choose a file. You can specify a file with the format
text or HTML.

You can then launch the statistics by clicking on button "TS". The processing can be long for a
large file. Once the process is finished, in the right-hand side, you will see the result of the
statistics. This result is also recorded in the "File for results" so that you can consult it in
another program.

Several options let you parameterize these statistics:

You can modify the list of the separators of words and sentences.

You can ask the program to be ignore the difference between tiny or capital letters and characters
accentuated or not by selecting the suitable boxes.

You can also indicate as the file uses the table of characters from DOS (ASCII) instead of ANSI.

You can also ask for a search of repetitions of words. This aims to help avoiding the use of the
same word in a short interval of text. For that, you must select the box and define the number of
words for the interval. You can also define a list of words to be ignored in this search, if not,
the result is likely to become unusable.
</q>

-------------------

Program: TextSTAT (2)
Author: Matthias Hüning
Install: n.i.; n.r.
Ware: (Freeware) (open source)
http://www.niederlandistik.fu-berlin.de/textstat/

TextSTAT is a simple programme for the analysis of texts. It reads ASCII/ANSI texts (in different
encodings) and HTML files (directly from the internet) and it produces word frequency lists and
concordances from these files. This version includes a web-spider which reads as many pages as you
want from a particular website and puts them in a TextSTAT-corpus. The new news-reader puts news
messages in a TextSTAT-readable corpus file.
New in version 2.4: TextSTAT now reads MS Word and OpenOffice files. No conversion needed, just add
the files to your corpus...
In TextSTAT you can use regular expression which provides you with powerful search possibilities.
The programme is multilingual. Because it uses Unicode internally, TextSTAT can cope with many
different languages and file encodings. The user interface comes in three languages: English,
German, and Dutch.

Susan
--
Posted to alt.comp.freeware
Search alt.comp.freeware (or read it online):
http://www.google.com/advanced_group_search?q=+group:alt.comp.freeware
Pricelessware & ACF: http://www.pricelesswarehome.org
Pricelessware: http://www.pricelessware.org (not maintained)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top