desktop search engine to search several .pdf files at once?

  • Thread starter Achim Nolcken Lohse
  • Start date
A

Achim Nolcken Lohse

I need a desktop search engine that will run under Win98SE and can do
work searches (preferably boolean) on several .pdf files at once.

I've got a 600MB CD with 45 pdf files in a dozen nestled folders that
I need to search, and am looking for a more efficient method than
opening each file separately and searching it.

Is there any freeware that can do this? Ideally, it should have the
ability to produce a text file with the locations of the hits, and
preferably, some of the context.




Achim



axethetax
 
S

Sietse Fliege

Achim said:
I need a desktop search engine that will run under Win98SE and can do
work searches (preferably boolean) on several .pdf files at once.

I've got a 600MB CD with 45 pdf files in a dozen nestled folders that
I need to search, and am looking for a more efficient method than
opening each file separately and searching it.

Is there any freeware that can do this? Ideally, it should have the
ability to produce a text file with the locations of the hits, and
preferably, some of the context.

InfoRapid Search & Replace

http://www.inforapid.com/html/searchreplace.htm
http://www.inforapid.org/sr/sr.exe

See also these quotes from their pages :

Full text search in Html, Rtf, Pdf, WinWord and Excel files and
many other file formats

Before you can search PDF files with InfoRapid, you first have to
copy the freeware program pdftotext.exe, which is part of the XPDF
package, into the installation directory of InfoRapid Search & Replace.
http://www.inforapid.org/se/xpdf.zip to download a zip archive with the
program pdftotext.exe.

pdftotext.exe can convert only such PDF documents which are NOT
protected by a password. Password protected PDF files are therefore
searched and displayed as binary data. Many old PDF documents contain
LZW compressed text and pictures, which must be decompressed before the
search. In order to do so, pdftotext.exe needs the freeware program
gzip.exe, which must be also copied into the installation directory of
InfoRapid Search & Replace. gzip is available here: www.gzip.org
 
S

Son Of Spy

Another Option:
Search PDF
SearchPDF will scan a selected directory branch for PDF files and search
each PDF for a given text string. it display all matching and non matching
PDF files found in two list boxes. You can then open any selected matching
PDF file from within the program. Like MakePDF, it uses AFPL Ghostscript to
do all the hard work. In order to use the program, you must have
Ghostscript installed. This program also uses the PSTOTXT package, written
by Paul McJones and Andrew Birrell of Digital Equipment Corporation's
Systems Research Center (see the file pstotext.txt which is included in the
downlaod zip file for licencing details).

http://www.lexacorp.com.pg/soft/searchpdf11.zip ~90Kb

Download GhostScript:
http://unc.dl.sourceforge.net/sourceforge/ghostscript/gs813w32.exe ~7.8 Mb

Cheers!

SOS

--

Some You Won't Find Anywhere Else...

http://www.sover.net/~wysiwygx/index.html
. --- . . - - - - - - - - - - - -
/ SOS \ __ / Freeware - - - - - -
/ / \ ( ) / - - - - -
/ / / / / / / \/ \ - - - -
/ / / / / / / : : - - -
/ / / / / ' ' - -
/ / //..\\
=====UU==UU=====
'///||\\\'
' '' '
 
A

Achim Nolcken Lohse

I need a desktop search engine that will run under Win98SE and can do
work searches (preferably boolean) on several .pdf files at once.
Thanks Sietse, SOS. Have downloaded the files and instructions, and
will probably try the InfoRapid approach first, as it seems a bit
simpler. Will let you know how it goes.


Achim



axethetax
 
A

Achim Nolcken Lohse

I tried it, following all the instructions (there's not much to the
install), including putting pdftotext.exe in the serapid directory,
but no joy.

Inforapid recognized the presence of pdftotext (Acrobat Reader shown
in the external viewers drop down menu), and duly searched the pdf
file in the designated directory, but couldn't find a single hit. The
search was the simplest possible - "hunting" in a single pdf file,
case sensitivity was unchecked, and "whole word only" was also
unchecked. The file contains 9 instances of this word, but Inforapid
found none.

Worse - the program is not well-behaved. I initially set it to search
a folder containing 120 pdf files amounting to 64MB of material. The
display screen turned off during the search, and when I tried to
reactivate it by hitting a key on the keyboard, I got no response.
After trying some specific keys, the display came back, but was
shimmering, fuzzy, and showed three desktops side by side. Then the
system reset. I'm guessing the system ran out of memory.


Achim



axethetax
 
S

Sietse Fliege

Achim said:
I tried it, following all the instructions (there's not much to the
install), including putting pdftotext.exe in the serapid directory,
but no joy.

Yes, the install is quite straightforward.
The only thing might be that you did not mention gzip.exe.
This should also be put in the serapid directory, in case of old PDF
documents containing LZW compressed text/pictures.
Inforapid recognized the presence of pdftotext (Acrobat Reader shown
in the external viewers drop down menu), and duly searched the pdf
file in the designated directory,

Apparently you also have 'Use external converters' checked. :)
but couldn't find a single hit.
The search was the simplest possible - "hunting" in a single pdf file,
case sensitivity was unchecked, and "whole word only" was also
unchecked. The file contains 9 instances of this word, but Inforapid
found none.

Bummer. I can't really think of anything. :)
Worse - the program is not well-behaved. I initially set it to search
a folder containing 120 pdf files amounting to 64MB of material. The
display screen turned off during the search, and when I tried to
reactivate it by hitting a key on the keyboard, I got no response.
After trying some specific keys, the display came back, but was
shimmering, fuzzy, and showed three desktops side by side. Then the
system reset. I'm guessing the system ran out of memory.

I am puzzled. It behaves well on my system, XP, 2G, 256 MB.
You can let it build a cache which makes it fast.
I just did a search for the word 'windows' in 249 pdf's, totaling 231
MB, and the search finished in 5.50 sec.

Perhaps you first searched through pdf's on a CD, which might complicate
things, memory-wise.
Maybe you could try deleting the cache (seCache.tmp), then start a new
search through one pdf file on the hard disk.
You should at least be able to get correct results searching txt files.
The author, Ingo Straub, has occasionally answered a question in this
group. You might want to try and e-mail him.
 
A

Achim Nolcken Lohse

Yes, the install is quite straightforward.
The only thing might be that you did not mention gzip.exe.
This should also be put in the serapid directory, in case of old PDF
documents containing LZW compressed text/pictures.

Yes, but not likely the case here. These pdf files were only recently
created by a commercial outfit.

Just to be sure, I installed gzip. The instructions, unfortunately,
are not clear. Gzip wants to install itself in its own directory. I
put the directory in the seRapid folder. The inforapid instructions
say to simply copy the executable into the same folder as inforapid,
but which, there are several? So for good measure, I then copied
gzip.exe and gunzip.exe from their folder into the seRapid folder.

It made no difference.
Apparently you also have 'Use external converters' checked. :)
yes.


Bummer. I can't really think of anything. :)


I am puzzled. It behaves well on my system, XP, 2G, 256 MB.

Well, my box is in a different category Pentium 75MHz, 128MB, but I
didn't see any system requirements listed on the site.
You can let it build a cache which makes it fast.
I just did a search for the word 'windows' in 249 pdf's, totaling 231
MB, and the search finished in 5.50 sec.

Perhaps you first searched through pdf's on a CD, which might complicate
things, memory-wise.

Yes, the pdfs were on a CD. So following your suggestion, I searched
them on the hard drive - same result.
Maybe you could try deleting the cache (seCache.tmp), then start a new
search through one pdf file on the hard disk.

Makes no difference either.
You should at least be able to get correct results searching txt files.

Ok. Text file searches worked.

And then I tried searching some other pdf files, and found that
InfoRapid works on some, not on others. So perhaps it depends on the
version of Acrobat used to create the files?

The program shows "memory used" in the progress bar before and after
search sessions. On my system, it starts at about 40%, in yellow, and
then turns red at about 60%. But maybe it's not monitoring while doing
the search, resulting in a system crash?
The author, Ingo Straub, has occasionally answered a question in this
group. You might want to try and e-mail him.

OK. I'll send him a copy of this post, and offer to send on a couple
of the non-searchable pdf files, if he's interested.



Achim



axethetax
 
S

Sietse Fliege

Achim said:
Just to be sure, I installed gzip. The instructions, unfortunately,
are not clear. Gzip wants to install itself in its own directory. I
put the directory in the seRapid folder. The inforapid instructions
say to simply copy the executable into the same folder as inforapid,
but which, there are several? So for good measure, I then copied
gzip.exe and gunzip.exe from their folder into the seRapid folder.

I copied both as well. I believe only gzip.exe is required, though.
And then I tried searching some other pdf files, and found that
InfoRapid works on some, not on others. So perhaps it depends on the
version of Acrobat used to create the files?

You can also try the latest version (3.0) of pdf2txt.exe.
http://www.foolabs.com/xpdf/
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.00-win32.zip (1141565 bytes)
The program shows "memory used" in the progress bar before and after
search sessions. On my system, it starts at about 40%, in yellow, and
then turns red at about 60%. But maybe it's not monitoring while doing
the search, resulting in a system crash?

Does your system keep crashing with it?
 
A

Achim Nolcken Lohse

You can also try the latest version (3.0) of pdf2txt.exe.
http://www.foolabs.com/xpdf/
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.00-win32.zip (1141565 bytes)

Good idea. The one I got from the inforapid link was V2.02. But it
doesn't seem to make any difference.
Does your system keep crashing with it?

Only on the one large search on the CD so far. So after installing
xpdf 3.0, I tried searching on a large chunk of the CD again - about
385MB. The search dragged on for 28 minutes, the progress bar showed
it about 75% complete. It was working on the last folder of pdf files,
all of them maps.

Then the screen went black, and like the previous time, the keyboard
wouldn't bring the display back, but moving the mouse did. Again, the
desktop showed in triplicate, too shimmery and unfocused to be
readable. I managed to get a drop down menu, but couldn't read it. And
then the system reset again.

The only difference from the previous attempt was that I got one hit,
and two "sort ofs", on one, the context was garbage characters, and on
the other, only the file was shown, not the target word or its
context.

So no joy.

BTW - a 64MB search on my hard drive only took about a minute by
comparison. But they weren't the same files. I have a hunch the map
files are slowing things down.




Achim



axethetax
 
S

Sietse Fliege

Achim said:
Only on the one large search on the CD so far. So after installing
xpdf 3.0, I tried searching on a large chunk of the CD again - about
385MB. The search dragged on for 28 minutes, the progress bar showed
it about 75% complete. It was working on the last folder of pdf files,
all of them maps.

Then the screen went black, and like the previous time, the keyboard
wouldn't bring the display back, but moving the mouse did. Again, the
desktop showed in triplicate, too shimmery and unfocused to be
readable. I managed to get a drop down menu, but couldn't read it. And
then the system reset again.

The only difference from the previous attempt was that I got one hit,
and two "sort ofs", on one, the context was garbage characters, and on
the other, only the file was shown, not the target word or its
context.

So no joy.

BTW - a 64MB search on my hard drive only took about a minute by
comparison. But they weren't the same files. I have a hunch the map
files are slowing things down.

I read in pdftotext.txt in the latest xpdf distribution:

"BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files."

I guess that this might account for what happens w.r.t. the maps.
But with 'normal' pdf's results should be correct (maybe provided an
eventual corrupt cache has been deleted).

Other than that : beats me.
InfoRapid also worked fine on my win95 box, P166, 72 MB.

I hope it'll work out in the end for you.
I like the program (which is also Pricelessware).
 
A

Achim Nolcken Lohse

On Wed, 18 Feb 2004 13:31:29 +0100, "Sietse Fliege"

....
I read in pdftotext.txt in the latest xpdf distribution:

"BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files."

I guess that this might account for what happens w.r.t. the maps.

Yes, it accounts also for one of the non-readable pdfs I tested, which
seems to be nothing but a huge image file.
But with 'normal' pdf's results should be correct (maybe provided an
eventual corrupt cache has been deleted).

Yes. It's definitely not a cache problem, nor an OCR problem, because
in several of the test pdfs, AR's search engine had no trouble finding
the target words.
Other than that : beats me.
InfoRapid also worked fine on my win95 box, P166, 72 MB.

It gets worse. I just tried running it on my top machine, an AMD K6-2
500MHz with 384MB of RAM. I copied the whole SERapid folder onto a
Zip, sneakered it over to the AMD, copied it onto the D: drive, and
ran it.

It started up, but only showed the top two fields in the Search
Diaglog window. So I was able to change the file type, but not the
directory to search , or the type of search, as these fields didn't
display.

I tried adding the path in front of the file type in the file type
field, but that did nothing. The only directory InfoRapid would search
is C:\!

I went through my files to see if Inforapid required special
installation, but couldn't find any references. I believe it did use
an installer, but didn't seem to do much except copy everything into
its directory undr \Program Files, and add an icon to the startup
menu.

In any case, its not obvious why it would search only C:\ when its
installed on D:\.

(This is why I have 20 untried freeware programs sitting on my hard
drive for every one that I've actually installed :/)



Achim



axethetax
 
S

Sietse Fliege

Achim said:
I went through my files to see if Inforapid required special
installation, but couldn't find any references.

As it also ran fine on my Win95 box and no special requirements are
mentioned, I'ld assume that there are none.
But your problems put that somewhat in doubt.
It looks like seRapid is written in Visual C++ v7.0.
AFAIK that does not necessarily mean seRapid also depends on e.g.
msvcp70.dll. I might be wrong but it looks like seRapid "only" needs
e.g. msvcp60.dll. I hope the author will help you out, here.
I believe it did use an installer, but didn't seem to do much except
copy everything into its directory undr \Program Files, and add an
icon to the startup menu.

I did not monitor the install, but had a look in its INSTALL.LOG.
It suggests that for a completely proper installation, setup is required
rather than just copying files.
It looks like in the latter case at least SEStart.dll and non-critical
functions, like contextmenu and printing search results might not be
properly registered.
In any case, its not obvious why it would search only C:\ when its
installed on D:\.

(This is why I have 20 untried freeware programs sitting on my hard
drive for every one that I've actually installed :/)

I'm out of my depth and hope that the author will help.
 
A

Achim Nolcken Lohse

As it also ran fine on my Win95 box and no special requirements are
mentioned, I'ld assume that there are none.
But your problems put that somewhat in doubt.
It looks like seRapid is written in Visual C++ v7.0.
AFAIK that does not necessarily mean seRapid also depends on e.g.
msvcp70.dll. I might be wrong but it looks like seRapid "only" needs
e.g. msvcp60.dll. I hope the author will help you out, here.


I did not monitor the install, but had a look in its INSTALL.LOG.
It suggests that for a completely proper installation, setup is required
rather than just copying files.
It looks like in the latter case at least SEStart.dll and non-critical
functions, like contextmenu and printing search results might not be
properly registered.

Thanks, I should have looked it up myself - I use Quarterdeck
Cleansweep to monitor my installs. I'll try doing a full install and
see if it makes a difference to the way the program runs.
I'm out of my depth and hope that the author will help.

I may have to e-mail him again, because I got a cryptic message from
my ISP that handles incoming telling me a post of mine had been
rejected due to excessive length. But it gave no hint as to what the
message or who the addressee was, and the date stamp didn't correspond
to anything I have in my logs (I use one ISP for sending mail and
another for receiving, which complicates things a bit). Strangely, I
got no error message from the sending ISP.

regards,

Achim


Achim



axethetax
 
S

Sietse Fliege

Achim said:
I may have to e-mail him again, because I got a cryptic message from
my ISP that handles incoming telling me a post of mine had been
rejected due to excessive length. But it gave no hint as to what the
message or who the addressee was, and the date stamp didn't correspond
to anything I have in my logs (I use one ISP for sending mail and
another for receiving, which complicates things a bit). Strangely, I
got no error message from the sending ISP.

FWIW: I happen to have e-mailed him about a month ago (about a new
program that he released in German language only.)
Got an answer within two days.
 
S

Susan Bugher

Achim said:
BTW - a 64MB search on my hard drive only took about a minute by
comparison. But they weren't the same files. I have a hunch the map
files are slowing things down.

FWIW I have a hunch you may be right. My impression is that InfoRapid
S&R is much slower with large files. Replacing a lot of text items in
one moderately large file is a slow process on my machine (Win98 PIII
128 MB RAM). Great program though IMO. :)

Susan
 
I

Ingo Straub

Hello Achim,

Sorry, but I haven't received your e-mail. I think my postbox is
limited to 5 MB. Please don't send me such large e-mail without asking
me before.

Have you tried to copy the PDF files on your harddisk and search them
there? If that helps then it's a problem with Windows 98 and your
CD-ROM driver. I have seen similar problems with other programs. It
occures when the PC is busy with calculating some results and when it
doesn't access the CD-ROM drive for some time. Then the CD stops
spinning and next time when the PC tries to access the CD-ROM drive, a
blue screen appears and the system must be rebooted.

If that doesn't help then you can try the following: Make sure that
the options "Use internal converters for HTML and RTF files" and "Use
external converters" on the tab CONVERTERS are checked. On the tab
SEARCH you have to leave the field SEARCH FOR empty. Press the Start
button. Then InfoRapid displays a list of all PDF files in your search
directory. When you click on one of the file names, then InfoRapid
shows the text which was returned by PDFTOTEXT, after it has converted
your PDF file into a text file. Every word which is contained in this
text file can be found by InfoRapid. If your SEARCH WORDS are not
contained in this text, then they will never be found.

Regarding your second problem: You can resize the search dialog with
the tracker at the top of the search bar. Just track it upwards with
the left mouse button held down and you will find the missing input
fields.

Best Regards
Ingo
 
A

Achim Nolcken Lohse

Hello Achim,

Sorry, but I haven't received your e-mail. I think my postbox is
limited to 5 MB. Please don't send me such large e-mail without asking
me before.

Sorry Ingo, I didn't realize how large the files were until too late.
I believe there were two at about 1.7MB, and one or two much smaller
ones.
Have you tried to copy the PDF files on your harddisk and search them
there?

Yes, it makes no difference to the ones that get no hits. Of course
the search is much faster.


.....
If that doesn't help then you can try the following: Make sure that
the options "Use internal converters for HTML and RTF files" and "Use
external converters" on the tab CONVERTERS are checked. On the tab
SEARCH you have to leave the field SEARCH FOR empty. Press the Start
button. Then InfoRapid displays a list of all PDF files in your search
directory. When you click on one of the file names, then InfoRapid
shows the text which was returned by PDFTOTEXT, after it has converted
your PDF file into a text file. Every word which is contained in this
text file can be found by InfoRapid. If your SEARCH WORDS are not
contained in this text, then they will never be found.

This approach displays a screen of garbage characters for the most
part. Some of the pdf files in the batch I'm trying to search can't be
searched because they're simple image files. They couldn't be searched
by AR's internal search mechanism either. However, there are other
which are searchable, but which InfoRapid/pdftotext can't decipher.
Unfortunately, they're mostly very large files of one Megabyte and up.
I'll try to find a smaller one.
Regarding your second problem: You can resize the search dialog with
the tracker at the top of the search bar. Just track it upwards with
the left mouse button held down and you will find the missing input
fields.
Yes, this worked, thanks. I did a search on my faster PC after trying
this, and managed to get the same single hit I did before. Then I
tried the same search with another word, and the system locked up
again.

Will contact you by e-mail when I have more meaningful information.





Achim



axethetax
 
I

Ingo Straub

Hello Achim,

The problem is that the files you can't search with InfoRapid are copy
protected. When you generate PDF documents, you can choose if other
people should be allowed to copy the text of not. PDFTOTEXT respects
this flag and doesn't convert the PDF file into a text file.

Best Regards
Ingo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top