[help please!] Extract selected data from html files using keywords?

H

Huty

Hi,

Do you know a freeware (or not) that would enable me to extract some data in
one text file with selected keywords from about 100 html (or php3,..) files?
In fact, I have already tried a demo of this (paying) program
"ListExtractor1.1" ( http://listextractor.hypermart.net/ note: I am not
interested in email addresses at all). But it does not do the full job I
need. ;-(
I would like to keep in only one text file the first group of numbers that
is available between two keywords (so that it will keep only one result
because usually I have the same first keyword repeated a lot of time inside
my saved html pages).
For instance :

Inside the html file, there is the following text :
---
Number : 1258622
Sector : electric
Net : 8.2%
---

So I would like the program to search between the keyword "Number" and
"Sector" and keep the data "1258622" in a text file. Then add in the same
text file and same row the number "8.2%" by using keep data "8.2%" between
the keywords "Sector" and "Net".
Finally I would love that the program change of row for each html page
processed.

Many thanks in advance for any help, ;-)
Huty
 
O

omega

Huty said:
Do you know a freeware (or not) that would enable me to extract some data in
one text file with selected keywords from about 100 html (or php3,..) files? [...]
I would like to keep in only one text file the first group of numbers that
is available between two keywords (so that it will keep only one result
because usually I have the same first keyword repeated a lot of time inside
my saved html pages).
For instance :

Inside the html file, there is the following text :
---
Number : 1258622
Sector : electric
Net : 8.2%
---

So I would like the program to search between the keyword "Number" and
"Sector" and keep the data "1258622" in a text file. Then add in the same
text file and same row the number "8.2%" by using keep data "8.2%" between
the keywords "Sector" and "Net"

I reread your message through a number of times, while trying to make a fit
between your objective and a certain kind of operation that I have done a
number of times. Mine is using a program that does a "block replace" (BK
ReplacEm for example) on a large set of web pages, once I have identified
the workable pattern throughout the set, for the start and end blocks. I have
done this to preserve specific sections in the web page sets, and eliminate
the rest of the content. The original pages can be preserved, with the new
output pages separately. Then of course the output pages can all be combined,
as an operation afterwards. While thinking this through, though, and esp
with not being able to see the full content of your pages for the workable
patterns...I concluded that I could not end by presenting a solution along
these lines that felt solid or satisfactory.

Then I was struck by another approach entirely. I have been repeatedly
quite impressed by the uses I've seen discussed of the DOS Find command
(using it w redirector /output to file).

I have to admit, I do not have much direct experience myself with the Find
command. If it were my project, I'd be starting of course with getting an
idea of its abilities by doing a "Find /?" Then I'd be looking over examples
I'd stored, and so on. Definitely I am not qualified to suggest the best way
for you to use it here.

Here is what I _can_ suggest for you: ask specifically about using the Find
command as it relates to your project. Ask here first, and if no assist
comes forth, then ask the same in a DOS Batch newsgroup. If you get lucky,
perhaps in the Batch newgroup, someone might well be willing to write you
up a .bat file specific to what you need.

(And in process, might also then arise suggestions of programs similar to
DOS Find, but which enjoy additional abilities.)
Finally I would love that the program change of row for each html page
processed.

I believe to have understood your message. The exception is this last
sentence. Whatever the typo was that befell it, I was unable to figure
out its intended meaning.
 
R

rir3760

Do you know a freeware (or not) that would enable me to extract
some data in one text file with selected keywords from about 100
html (or php3,..) files? [Snip]
I would like to keep in only one text file the first group of
numbers that is available between two keywords (so that it will
keep only one result because usually I have the same first keyword
repeated a lot of time inside my saved html pages).
For instance :

Inside the html file, there is the following text :
---
Number : 1258622
Sector : electric
Net : 8.2%
---

So I would like the program to search between the keyword "Number"
and "Sector" and keep the data "1258622" in a text file. Then add
in the same text file and same row the number "8.2%" by using keep
data "8.2%" between the keywords "Sector" and "Net".
[Snip]
It can be achieved using GNU AWK. The downside is that GNU AWK is a
powerful command line tool that requires the users to actually RTFM
or it won't work ;-)

Looking at your example, you could save the following as a text file
with the name (for example) 'C:\Scripts\Test.awk':

# --- awk script ---
{
Line1=Line2
Line2=Line3
Line3=$0

if (match(Line1, "Number : [0-9]+")==1){
if (match(Line2, "Sector : [a-zA-Z]+")==1){
if (match(Line3, "Net : [0-9]+(.[0-9]+)?%")==1){
Index1 = match(Line1, "[0-9]+")
Index3 = match(Line3, "[0-9]+(.[0-9]+)?%")
print substr(Line1, Index1) " - " substr(Line3, Index3)
}
}
}
}
# --- awk script ---

After that open a DosBox and browse to the folder containing the
files you want to process, then execute the command:

gawk -f C:\Scripts\Test.awk *.html

That command will process all the .html files in the current
directory and display the output in the DosBox, if you want to save
the output to a file (for example Output.txt) then execute the
command:

gawk -f C:\Scripts\Test.awk *.html>Output.txt

One version of GNU AWK 3.1.3 can be found here:
<http://unxutils.sourceforge.net/>
<http://unxutils.sourceforge.net/UnxUpdates.zip>

And the GNU AWK manual is located here:
<http://www.fsf.org/software/gawk/manual/gawk-3.1.1/
html_node/index.html>

HTH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top