Extract Image From PDF


S

Steve

Hi all

Does anybody please know a way to extract an Image from a pdf file and save
it as a TIFF?

I have used a scanner to scan documents which are then placed on a server,
but I need to extract the image of the document (just the first page if
there are multiple pages) and save it as a TIFF so I can then use the
Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the company
I am working for if they provide a way to do this in my .NET application.

Thank you for your help.

Kind Regards,
Steve
 
Ad

Advertisements

R

Rick

I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert. You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
 
R

Rick

I run mine from Net process.start like this:

process.StartInfo.Arguments =
String.Format("-dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg3 -sOutputFile=""{0}""
""{1}""", Path.ChangeExtension(fileName, ".tif"), fileName)


if you want to read only the first page you would add -dFirstPage=1
and -dLastPage=1 (see http://web.mit.edu/ghostscript/www/Use.htm )

I am slightly confused about what you really want. I understood you want to
convert the entire first page to a tiff file and then use an OCR program to
read text. If you want to only extract an image from the first page, I'm
not sure this would work. I don't know of a facility to extract an image
from a pdf. You might check iTextSharp which can create and read pdf's. If
you know the name of the image you may be able to extract it.

Also, if you want a Tiff file why are you extracting to a jpeg below?

Rick
 
S

Steve Amey

Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a demo
app that can execute Ghostscript with command line parameters, and at the
moment I can only get the revision number and a thumbnail view of the first
page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :blush:(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve
 
Ad

Advertisements

S

Steve Amey

Thank you, I'm generating tiff files now.

The pdf is an image of a scanned document. I would like to get the text of
the scanned image. I looked into OCR, and came across Tesseract. To my
knowledge, Tesseract can (only) read a tiff file and extract the text. If I
open up a document in Adobe Pro and save the scanned image as a tiff,
tesseract does read most of it quite well, but my problem is that I have to
automate the process and can't open up the documents and manually save the
images, so I need something to extract the scanned image in the pdf file and
save it as a tiff so tesseract can read it. I tried iTextSharp already but I
get an error "PDF header signature not found", which I'm guessing is a
problem with the way the scanner creates the pdf files and iTextSharp can't
open it.

I found some sample code that creates a jpeg, which is what I posted, but I
didn't know how to create a tiff file, but I see that it's just a case of
changing the -sDEVICE parameter to the one you are using.

Unfortunately, the resulting tiff image is not great quality and tesseract
makes many errors when trying to read it, so I have to find another way or
give up :blush:(

Thank you for your help, if you know of any other way to do what I'm trying
then I'd love to know! I don't mind paying a small amount for some
commercial software that can extract images from pdf docs that I can use in
..NET, but I haven't found any yet that don't cost hundreds or even thousands
of dollars.
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top