Disaster with documents scanned into pdf. Please help

W

Wayne Fulton

"PDF is not designed to be able to get your data out of a PDF file for a
second try (at least Acrobat is not)."

So it seems you're saying that while Omnipage Pro can open a pdf file
and resave it at a smaller size, Adobe does not have this ability.

I just called both companies and their answers seem to confirm what you
said. Omnipage customer service told me I could open and re-save the
existing pdf at a smaller size. Adobe customer service wasn't sure and
they are getting back to me.


What they say should be interesting. Acrobat has a Paper Capture menu
that does OCR on the file contents and generates text characters
internally. There isnt any control, it simply does what it does. This
allows text searches on words in the document when it was originally an
image, and the file gets smaller.

Acrobats internal OCR wants 300 dpi too. I compared files containing a
300 dpi scan and a 200 dpi scan of the same page, and the 300 dpi image
followed by Capture (double size originally) was less than half (nearly
1/3) the final PDF file size of the same 200 dpi image followed by
Capture, because the 200 dpi OCR did very poorly, and mostly remained as
image.

But what I meant is that Acrobat provides no way to get to get your text
or your images back out of PDF. It isnt that it cant be done, Acrobat
just doesnt do it. I believe the thinking must be that we probably
created the PDF by printing our text source document to PDF in the first
place, so we can just use that original source document for other
purposes, and discard the unwanted PDF copy. The images in the PDF are
probably JPG anyway, and the source document probably wasnt.

People do use PDF to archive documents, and they can always be viewed or
printed, but Acrobat wont otherwise give back the actual text or images
(if you wanted to do something different with it now). I think that
not many understand that limitation.
 
L

lostinspace

----- Original Message -----
From: "Larry" <>
Newsgroups: comp.periphs.scanners
Sent: Tuesday, July 13, 2004 5:50 PM
Subject: Re: Disaster with documents scanned into pdf. Please help

"PDF is not designed to be able to get your data out of a PDF file for a
second try (at least Acrobat is not)."

So it seems you're saying that while Omnipage Pro can open a pdf file
and resave it at a smaller size, Adobe does not have this ability.

I just called both companies and their answers seem to confirm what you
said. Omnipage customer service told me I could open and re-save the
existing pdf at a smaller size. Adobe customer service wasn't sure and
they are getting back to me.

But what I'm thinking of doing is finding a computer service where you
can pay for the use of a computer by the hour that has the Omnipage
program on it, bring my pdf files there, and see if this will work.

Hello Wayne,
I'm not sure is your aware of the full-version of
Acrobat's save-as-TIF?
I previously advise the OP of this option when he created this 2nd thread
when disaster hit.

This reply is for your benefit and not the arse-OP

As a example?
Go to this link, download the PDF and save locally.
http://www.thebigm.com/sharedimages/TrainerHistory.PDF
Open Accroabt.
Save as TIF. You'll end up with four individual page TIF's.
Open Ifranview, open each TIF indvidualluy and "save as TIF" (these are
replacements for the Acrobat created TIF's, rename them if you choose.)
Open your OCR software and scan the files rather than the images. (may
require a change in your scanner settings.)

Perhaps for the thick headed OP "farming out" the job (as another previously
advised) is a better option, however I'm willing to wager that the farm-out
is a waste of time and will result with the same bulk or either a smaller
unusable bulk.

Addiotionally Acrobat Capture is a piece of junk (at least in the 5.0 I
have.)
 
L

lostinspace

Wayne,
A farther comparison in the difference in types of TIF's may
be seen at the Library of Congress' American Memory archives.

Pick any article:
http://memory.loc.gov/
make sure the view the image (at the AM site) at 100% rather than 50 or
text.
download a page (TIF) the file will have an extension TIF3.

Attempt to OCR this page and you get squat.
Open Ifranview and save the page as a TIF and then OCR the page :)))
 
W

Wayne Fulton

Hello Wayne,
I'm not sure is your aware of the full-version of
Acrobat's save-as-TIF?

I was aware of it, but I wasnt willing to think of it as retrieval of the
original data, if that was the point.

Yes, Acrobat (I have 6) will save the PDF file pages as TIF or other file
formats, but the image of the page is more like a large full page screen
capture. There may be configuration options somewhere, but my PDF pages with
only real text were coming out as a 1700x2199 pixel 200 dpi TIF image. But
pages that also had any little image on them were coming out as 1224x1583
pixel at 144 dpi. Both compute to 8.5x11 inch pages, but it didnt make sense
to me to downgrade the pages with images. But there seemed no concept of
retrieval of original data (real text is now fax-grade image, 300 dpi images
are now 144 dpi images, etc). But still, it may be an useful option.

I havent looked at your linked files yet, but when I saved my own
TIF-from-PDF in IrfanView to another TIF, and opened that, I still had the
same thing, number wise... Seemed right to me, seems IrfanView ought to save
whatever it has, or I'd be upset with it <g>

So I dont get the point yet, but I will look at your files too.
 
L

lostinspace

" havent looked at your linked files yet, but when I saved my own
TIF-from-PDF in IrfanView to another TIF, and opened that, I still had the
same thing, number wise... Seemed right to me, seems IrfanView ought to save
whatever it has, or I'd be upset with it <g>

So I dont get the point yet, but I will look at your files too."

The primary point is that the TIF files that I have saved from within
Acrobat will not OCR worth beans, if at all.

I haven't found it necessary to go through these procedures for PDF's that I
create.
99% of my documents are OCR into Word Pad and the PDF's are created from
Word.

The aforementioned extreme measures I've used extensively to recover and OCR
from others PDF's.
On my end, I find the TIF's created from Ifranview increase in overall file
size by at least five-fold as compared to those saved TIF's from Acrobat.
Even larger in some instances.
 
W

Wayne Fulton

The primary point is that the TIF files that I have saved from within
Acrobat will not OCR worth beans, if at all.

I haven't found it necessary to go through these procedures for PDF's that I
create.
99% of my documents are OCR into Word Pad and the PDF's are created from
Word.

The aforementioned extreme measures I've used extensively to recover and OCR
from others PDF's.
On my end, I find the TIF's created from Ifranview increase in overall file
size by at least five-fold as compared to those saved TIF's from Acrobat.
Even larger in some instances.


Sorry, I dont see the problem. I didnt try OCR, 200 dpi is marginally low
for best OCR anyway, so I am willing to believe it isnt great. But my pair
of images are the same, both 1700x2199 pixels, scaled to 200 dpi at 8.5x11
inches, 24 bit, etc, no change at all. There is no reason they should be
different, and it would seem a problem if they were.

The file sizes are different only because Acrobat saved the TIF with LZW
compression, and IrfranView didnt. But you can select LZW when you save the
TIF in IrfanView if you want - in the seperate box for TIF options at the
Save As dialog.. The IrfanView menu Image - Information tells about the
compression used in existing files, one says LZW, one says None. File
compression (at least lossless compression like LZW) wont affect OCR, the
file is uncompressed when it is opened in memory.
 
D

David R

I'm confused when you mention OCR and PDF. One is not the other. PDF
is a form of compressed image file.

Anyhow if you are just scanning text to PDF you should use Black &
White Bitmap and I usually use 300dpi.

TIP: When you are adding multple scans to a PDF file its compression
tends to be less efficient. For this reason it is a good idea to do a
Save As when you are finished. The new file often tends to be smaller
then the one you started with.
 
L

Larry

That's what I did: I scanned multiple pages of text using Black and
White at 300 dpi. Then when I stopped each scanning job and saved it as
a pdf file, the result, whether it was one page or 100 pages, averaged
at about 750 KB per page.

Larry
 
W

Wayne Fulton

That's what I did: I scanned multiple pages of text using Black and
White at 300 dpi. Then when I stopped each scanning job and saved it as
a pdf file, the result, whether it was one page or 100 pages, averaged
at about 750 KB per page.


Larry, I just tried a better PDF test and got surprising results, not
what I expected or remembered from previous versions.

I scanned one page of a stock 10K report. This was selected because it
was much like any regular book page, but it is a free public document.
It was 100% text, no images, so line art was very appropriate. The text
was not small, perhaps 12 pt, at least it was 6 lines per inch. The page
size was 8x10.5 inches, and I scanned 8x10 inches of it (excluded a bit
of blank margin). 8x10 inches at 300 dpi uncompressed line art computes
900KB (not in PDF).

Scanned with Acrobat 6 at 300 dpi line art mode (line art means all
pixels are pure black or pure white, there is no trace of gray tones in
line art), with Acrobat default compression, the one page PDF file was
16KB. Yes, 16KB, even at 300 dpi for printing. Line art compression is
almost infinitely effective on blank space, so mileage will vary with
each page of content.

Then the same scan, which was followed by the Acrobat Capture option to
create text, that result was 61KB. My experience with earlier versions
of Acrobat was that it got much smaller, but this didnt happen today.

Same scanned into Acrobat 6 as 300 dpi grayscale was 206KB.. decently
small considering, but too large to do many pages this way (and no point
of it if there are no images to be retained). Default Acrobat JPG
compression.

Scanned as 300 dpi grayscale into OmniPage Pro 12, recognized as a text
document (it did an excellent job automatically), and saved as a PDF
file, the one page PDF file was 78KB. (larger than I expected, but of
course vastly smaller than 750KB too).

The two files with text (OmniPage and Capture) were searchable, but the
others with images of course were not searchable, which is a bad thing in
a large PDF file. As to the size of text, my guess is that perhaps
adding text adds overhead, the total of which may be much less noticeable
for more than one page (but this was not tested).

If you wish, I can email you a couple of these sample files for your
inspection, so you can judge their size and clarity and see how they view
and print.
 
W

Wayne Fulton

I think the overhead theory may have substance in some way...
I scanned ten pages into OmniPage Pro (all the same one page as before
to keep things equal), and the PDF file was 145KB, or 15KB per page.
Only one page was 78KB, which was misleading.
 
W

Wayne Fulton

Sorry, the 15KB pages was the wrong answer, not real world. I was
worrying about the compression directory, and thought I should do it
different. So I scanned ten similar but different pages (from same
document), and got 301KB, or about 30KB per PDF page (via OmniPage Pro).
 
L

Larry

Wayne,

I handed the job back to the person for whom I was doing it, as I
couldn't load the Omnipage program onto the computer I was using for
reasons I don't need to go into here.

Using Omnipage Pro 14, he loaded the pdfs, saved them as graphical pdfs,
and the results were less than 10 percent the size of the originals. He
turned 10 pdf files totalling 545 MB into a bunch of pdfs totalling 50
MB.

He didn't do OCR on the files to get them into text because that
required corrections and the job was too big for that.

He's now placed the resulting files in a zip file at his web site where
people can open them or download them.

Thanks for all your great help.

Larry
 
L

Larry

Wayne, for your information, here's the e-mail from the person who
completed the job:

--------------

I tried to OCR them to produce text .pdf files, but the text-correction
process was too time-consuming, plus OmniPage has a nasty habit of
gratuituously mixing typefaces in the resulting output.

So I defaulted to 300-dpi graphical .pdf's, which are fine. The total
size is about 50MB, which is big, but not unmanageably so, especially
when broken down into individual files.

Searchability would have been nice, but we can live without it. What I
really need to know is how to stop Ominpage's nonsense with mixing
typefaces.

----------------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top