Jeff said:
Not sure why you posted twice ...
A PDF file is an image. To get the "data" out of it, you'd need to convert
it to something other than an image.
If you have a "pro" version of something like Adobe, you can try the OCR
(optical character recognition) feature to re-build the underlying data ...
but be aware that OCR is less than 100% accurate. Plan on having some of
your data 'lost in translation'.
Regards
Jeff Boyce
Microsoft Office/Access MVP
A PDF file can contain images, but to claim that "a PDF file is an
image" seems shockingly simplistic, IMO, unless you are only considering
the output to your screen. For example, the PDF 1.7 Reference
describing the PDF format contains about 1310 pages. See the discussion
in the following thread:
http://groups.google.com/group/microsoft.public.access/browse_frm/thread/d34aa27e14854f45
Basically, extracting text and images from a PDF file with 100% accuracy
ranges from fairly easy to very difficult depending on things like the
scope and method of compression used, the number of edits made and
whether or not PDF Linearization optimization was employed by the
program used to create the PDF file. For anything past "somewhat easy"
I recommend not using Access to perform the extraction from the data
streams even though Access theoretically has enough capability to
perform the task. I agree that image and text data can be extracted
from a screen capture (or try a simple copy/paste for text data), but I
consider those methods, especially the "lossy" OCR, to be last resorts.
I think I remember seeing a free software tool that can split a PDF
file into individual one page PDF files. Googling... Perhaps it was:
http://www.pdfhacks.com/pdftk/
Using something like that could possibly break a complex problem down to
smaller pieces that may be more amenable to data extraction. If all
else fails, there are likely many commercial software packages that can
extract data from PDF files and that cost under $100.00.
James A. Fortune
(e-mail address removed)