Copying Text from a PDF file???

C

casey.o

I got a PDF file for some personal business I am dealing with. It's all
text (no pictures). I need to respond to portions of this file by
email, and want to copy portions of the text to the email, so I dont
have to type all of the stuff by hand. (It's 17 pages long). I have
been trying for 3 days to figure out how to copy the text in the PDF to
my email or to Notepad.

Ideally, I'd like to copy the entire PDF to a text file, so I can apply
my comments using Wordpad or even Notepad, and then email the final
results. Out of those 17 pages, there are at least 10 pages that need
to be commented on.....

I've tried, Foxit reader, PDF-Exchange, Sumantra, and Adobe Reader
(Acrobat). I began doing this in Win98, but copied the PDF to my XP
machine so I could install a newer version of Adobe Reader. I've tried
damn near everything. All I can get is either a blank .txt file, or one
one occasion, I got the FIRST PAGE copied but could not get any further
ones copied. I CAN copy them to a picture file (such as .JPG), but that
wont work.

I read some stuff about this on the web, and it appears taht some PDF
files are "locked" to prevent modifying them or extracting text from
them. I'm beginning to think this one may be locked. (except that I
did copy that first page????).

Can anyone give me some help with this, please!!!

Thanks

BTW: PDF-Exchange has an option to "export to text", but it's only in
the registered (paid) version. I have the free one.
 
G

Good Guy

I got a PDF file for some personal business I am dealing with. It's all
text (no pictures). I need to respond to portions of this file by
email, and want to copy portions of the text to the email, so I dont
have to type all of the stuff by hand. (It's 17 pages long). I have
been trying for 3 days to figure out how to copy the text in the PDF to
my email or to Notepad.

Ideally, I'd like to copy the entire PDF to a text file, so I can apply
my comments using Wordpad or even Notepad, and then email the final
results. Out of those 17 pages, there are at least 10 pages that need
to be commented on.....

I've tried, Foxit reader, PDF-Exchange, Sumantra, and Adobe Reader
(Acrobat). I began doing this in Win98, but copied the PDF to my XP
machine so I could install a newer version of Adobe Reader. I've tried
damn near everything. All I can get is either a blank .txt file, or one
one occasion, I got the FIRST PAGE copied but could not get any further
ones copied. I CAN copy them to a picture file (such as .JPG), but that
wont work.

I read some stuff about this on the web, and it appears taht some PDF
files are "locked" to prevent modifying them or extracting text from
them. I'm beginning to think this one may be locked. (except that I
did copy that first page????).

Can anyone give me some help with this, please!!!

Thanks

BTW: PDF-Exchange has an option to "export to text", but it's only in
the registered (paid) version. I have the free one.


Open the file in PDF reader and then you can copy the entire file text
by going to:

Edit >> Select all

Now to copy this,

CTRL + C or

Edit >> Copy

To paste in Notepad, or email, simply paste it by doing:

CTRL + V

or

Edit >> Paste

You can now delete the sections that are not wanted.

Good luck.
 
P

Paul

I got a PDF file for some personal business I am dealing with. It's all
text (no pictures). I need to respond to portions of this file by
email, and want to copy portions of the text to the email, so I dont
have to type all of the stuff by hand. (It's 17 pages long). I have
been trying for 3 days to figure out how to copy the text in the PDF to
my email or to Notepad.

Ideally, I'd like to copy the entire PDF to a text file, so I can apply
my comments using Wordpad or even Notepad, and then email the final
results. Out of those 17 pages, there are at least 10 pages that need
to be commented on.....

I've tried, Foxit reader, PDF-Exchange, Sumantra, and Adobe Reader
(Acrobat). I began doing this in Win98, but copied the PDF to my XP
machine so I could install a newer version of Adobe Reader. I've tried
damn near everything. All I can get is either a blank .txt file, or one
one occasion, I got the FIRST PAGE copied but could not get any further
ones copied. I CAN copy them to a picture file (such as .JPG), but that
wont work.

I read some stuff about this on the web, and it appears taht some PDF
files are "locked" to prevent modifying them or extracting text from
them. I'm beginning to think this one may be locked. (except that I
did copy that first page????).

Can anyone give me some help with this, please!!!

Thanks

BTW: PDF-Exchange has an option to "export to text", but it's only in
the registered (paid) version. I have the free one.

In Acrobat Reader, look for Document Properties : Security.

Copy Content or Extraction : Allowed

That's what I see in an unprotected document. I am allowed
to copy text out of a document.

On top of that protection, there are a number of other ways
that an author can turn your copy attempt into a word jumble.
(It's possible to make the copy/paste buffer have different
contents than you see on the screen.) So even when you find
the patch that disables the security bit, it doesn't mean
you are home free. A method like this could be used on top
of it all.

http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF

It took me two weeks, the writing of a bunch of scripts,
plus using a font editor, but I have defeated that one.
I stuck with it to prove a point. Part of the process
amounts to "human OCR", in the sense that to undo the
obfuscation, you have to rearrange entries in a font table,
and you do that process by eye (cannot be automated, without
invoking OCR at some level). You're better off just
OCRing the entire document.

There is a gentlemens agreement amongst PDF engine writers,
to not defeat the security features. I don't know if the
current DMCA legislation would back that up with enforcement
of any sort or not. At least one package, the author said
"I don't approve of breaking security", but someone looked
at the source, and figured out what part of it to comment out,
so that copy protection would be ignored. If you look at the
source code, it almost looks like the author clumped the necessary
stuff all in one small area, to make it easier to patch out. So
I have to wonder if the author was being on the level about it.
That package may have been the Linux version of XPDF. (The Windows
version is just some conversion programs, and you might need
MinGW to build such a thing. The viewer application uses XWindows,
and Windows does not normally deal with XWindows protocol itself.
Which is why the pre-compiled package for Windows, is missing the
most important part.)

I think at one time, I had a version of Ghostscript with the
necessary check commented out. That stopped working when the
later versions of PDF starting adding more varieties of
security. At that time, there wasn't any password that
needed to be bypassed.

There is at least one software product, which smashes all the
security. It uses brute forcing for the highest security level
in PDF, so that takes some time. The less substantial security,
it busts that in seconds. But the software costs money, and
that's why I don't have a copy. If you Google on "PDF Password Removal"
you can find other examples. With the password smashed, perhaps
one of your "editing" tools will allow changing the document
properties or something.

http://www.elcomsoft.com/apdfpr.html#chart

*******

One technique to encourage WYSIWYG is to use OCR.

If you print the PDF to file, and make images out of it,
it can be fed to an OCR program. Some of the old OCR programs
are a pain in the ass, in that they require image files with
a resolution of 200 DPI to 400 DPI. It's relatively hard when
you're sitting in the ole computer chair, to figure out
a process to make the document fall into that DPI range. The
OCR program won't tell you what DPI it thought you had
uses, so you can't just make the exact adjustment at get
it on the second try. It usually takes me a bunch of tries
until I nail it. Later OCR programs may be a bit more flexible
about the resolution (accept a really high resolution, and
not complain about it). With OCR, you will have a lot of
errors to correct.

The full Adobe Distiller has a built-in OCR, but obviously
they don't allow you to make the mental jump, of defeating
copy protection by just doing OCR to the entire document. So
that option would require some "lather, rinse, repeat" type
cycles :) The built-in OCR only works, if the document
is a bitmap image captured into a PDF. It's almost like it
was intended to work with PDFs made by a scanner :) And
not for smashing copy protection.

Paul
 
M

Mayayana

One additional note: Sometimes people actually
scan in pages and put those into a PDF. In other
words, a PDF that's all text is sometimes actually
images of pages. In that case thee is no text.
Each page must be extracted and OCR-ed.
 
C

casey.o

One additional note: Sometimes people actually
scan in pages and put those into a PDF. In other
words, a PDF that's all text is sometimes actually
images of pages. In that case thee is no text.
Each page must be extracted and OCR-ed.

This particular PDF file, *IS* a scan. I think that explains a lot
about it. The filename even has the word "scan" in it, and it contains
a few underlines that were done with a pen. With that said, Im at a
loss how to do the OCR thing. I do recall back in the Win3.x days, I
had a page scanner that included OCR software, so I'm familiar with that
sort of thing, but that scanner and the software are all history, and
might not work on XP or Win98 anyhow. And probably would not work
properly without the scanner either.

I'd appreciate some help in this matter, such as where to obtain the OCR
software (free if possible), and what steps are needed to do the
conversion. All efforts thus far at converting it, have resulted in
either NO output, or a minimal amount of jumbled characters. The fierst
page, which was semi-converted, contains all numerical characters, as in
a amortization schedule. That was the only part that could be partially
converted to text, and that was scrambled, but still readable.

Thanks
 
P

Paul

This particular PDF file, *IS* a scan. I think that explains a lot
about it. The filename even has the word "scan" in it, and it contains
a few underlines that were done with a pen. With that said, Im at a
loss how to do the OCR thing. I do recall back in the Win3.x days, I
had a page scanner that included OCR software, so I'm familiar with that
sort of thing, but that scanner and the software are all history, and
might not work on XP or Win98 anyhow. And probably would not work
properly without the scanner either.

I'd appreciate some help in this matter, such as where to obtain the OCR
software (free if possible), and what steps are needed to do the
conversion. All efforts thus far at converting it, have resulted in
either NO output, or a minimal amount of jumbled characters. The fierst
page, which was semi-converted, contains all numerical characters, as in
a amortization schedule. That was the only part that could be partially
converted to text, and that was scrambled, but still readable.

Thanks

The hardest part of this, will be converting from PDF to an image.
I do this thing all the time, but the setup the first time isn't
all that easy. Maybe someone else will volunteer a solution...
One that doesn't suck :)

I found a server the other day, this one. This uses GNU Ocrad.
It says right on the site, to not upload "confidential" documents.
So the purpose of this, is so you can see just how miserable
a free OCR can be :) When I tested this, it was making the
easy errors that the old OCRs I paid good money were making
say about ten years ago.

http://ocr1.sc.isc.tohoku.ac.jp/e1/

When I convert PDF to BMP or the like, I use GIMP.
You run this on your WinXP machine.

http://www.gimp.org/downloads/

Next, You'll need a copy of Ghostscript.

http://www.ghostscript.com/download/gsdnld.html

An example of the install location, is

C:\Program Files\gs\gs9.10\bin

and that download page says the current version is 9.14.

The tricky part, is going back to GIMP, and telling it
the path of the executable for Ghostscript.

Now, if someone else can tell you where to get
a "direct" converter, that would be a lot better
than installing two things, and trying to get one
to "call" the other. Once you get it running though,
it does a good job. Gone are the days of the
fonts being a total disaster. The fonts
no longer bother me :) And part of that,
is the practice of embedding fonts when
a PDF is produced. That's probably helped as
much as anything.

Paul
 
M

Mayayana

In addition to Paul's info, you might look into
XPDF. It's a simple program that hasn't been
kept updated, but it can extract images from
a PDF.

I have a copy of Textbridge Pro 8 that came
free on a printer CD. It works surprisingly well, so I've
never needed anything else. Unfortunately, they
don't seem to give away good software on install
CDs anymore.
 
J

J. P. Gilliver (John)

[QUOTE="Paul said:
This particular PDF file, *IS* a scan. I think that explains a lot
about it. The filename even has the word "scan" in it, and it contains
a few underlines that were done with a pen. With that said, Im at a
[]
The hardest part of this, will be converting from PDF to an image.
I do this thing all the time, but the setup the first time isn't
all that easy. Maybe someone else will volunteer a solution...
One that doesn't suck :)[/QUOTE]

I have "PDF Image Extraction Wizard" - "Only the first three images will
be extracted" for the free version I have, which is dated 2010 and
mentions www.rlvision.com; there are plenty of others though. But some
OCR software - I think Omnipage is one - will take a PDF as input.
(Because, I presume, a lot of scanner drivers seem to produce PDF as the
default, which has always struck me as an odd choice, especially for
single sheets.)
I found a server the other day, this one. This uses GNU Ocrad.
It says right on the site, to not upload "confidential" documents.
So the purpose of this, is so you can see just how miserable
a free OCR can be :) When I tested this, it was making the
easy errors that the old OCRs I paid good money were making
say about ten years ago.
[]
There are a few free ones about, and a lot more are free-with-a-scanner
(including all-in-one "printers" these days); if the one that worked
with your old scanner won't run under your present Windows, ask friends
and neighbours if they have one you could borrow, if you can't get on
with the free ones you can find online. I've been quite favourably
impressed with one called something like ABBYY that is common with
scanners (and _may_ be downloadable, if only by pretending you have a
scanner).

Most such come-with-scanner OCRs _will_ work without the scanner, from
an image (might have to be in a different format, such as [PDF or] TIFF
rather than GIF/JPG, but that's easily converted, e. g. in IrfanView);
they work on the basis you might want to OCR an image you created
earlier.
--
J. P. Gilliver. UMRA: 1960/<1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

.... much to the surprise of everyone else in the galaxy, who had not realised
that the best way not to be unhappy is not to have a word for it. (Link
episode)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top