OT PDF converter

F

Flash

Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA
 
P

philo 

Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA



Just cut and paste your pdf document into Word, that's all there is to it
 
P

Paul

philo said:
Just cut and paste your pdf document into Word, that's all there is to it

That's not the half of it.

1) PDFs exist, which were "made on a FAX machine".
You can't wipe your mouse over the text.
Only OCR works with those. There are various
means to get yourself a copy of an OCR.

Each page in the document, is a pixmap.
There are no text strings. Only images, one per page.

2) There is a special obfuscation technique that takes
advantage of the double indirection of PDF text
representations. It allows the on-screen view to
have perfectly good, readable letters. But if you wipe
over the text, the copy/paste buffer has "garbage" in it.

I actually "fixed" a document like that, when someone presented
such a document as an example. It took me two weeks, a number
of scripts, but I eventually returned the mapping to "linear".
Such that, if you wipe over the text, the copy/paste buffer
has the real goods in it.

Certain third-party word processing tools, possess that
capability (controlled obfuscation).

Details of the method, here. This is the idiot that started it.

http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF

Such damage, could not (easily) be recovered by a PDF to DOCX
tool. That's because each word processing developer, uses
slightly different variations on a common theme.

So while it's fun to pretend that someone is going to write
the "perfect PDF converter", you could be in for a rude
shock if you test it for long enough. There's some stuff
that is *really hard* to copy. And that's because the
industry no longer relies on those nifty "security" features
for total protection. The obfuscation method, means it's a whole
new ballgame.

Paul
 
J

JJ

That's not the half of it.

1) PDFs exist, which were "made on a FAX machine".
You can't wipe your mouse over the text.
Only OCR works with those. There are various
means to get yourself a copy of an OCR.

Each page in the document, is a pixmap.
There are no text strings. Only images, one per page.

2) There is a special obfuscation technique that takes
advantage of the double indirection of PDF text
representations. It allows the on-screen view to
have perfectly good, readable letters. But if you wipe
over the text, the copy/paste buffer has "garbage" in it.

I actually "fixed" a document like that, when someone presented
such a document as an example. It took me two weeks, a number
of scripts, but I eventually returned the mapping to "linear".
Such that, if you wipe over the text, the copy/paste buffer
has the real goods in it.

Certain third-party word processing tools, possess that
capability (controlled obfuscation).

Details of the method, here. This is the idiot that started it.

http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF

Such damage, could not (easily) be recovered by a PDF to DOCX
tool. That's because each word processing developer, uses
slightly different variations on a common theme.

IOTW, PDF is a digital printed document.
The document data is more like a PostScript /script/.
Same goes to XPS. And DjVu (I think). Although it's not based on PostScript.
So while it's fun to pretend that someone is going to write
the "perfect PDF converter", you could be in for a rude
shock if you test it for long enough. There's some stuff
that is *really hard* to copy. And that's because the
industry no longer relies on those nifty "security" features
for total protection. The obfuscation method, means it's a whole
new ballgame.

For unprotected/unobfuscated documents that aren't pre-rendered, text font
styles (i.e.: bold, italic, etc.) can still be regocnized and converted.

Generally, the page content layout is difficuly to convert. e.g.: multi
columns page, inset, etc. The converter must have a layout regocnition
engine for that (which I doubt).

Tables are also difficult to covert, since there aren't any standard
/optical table regocnition/ engine that I know of. Converting an
Excel-converted PDF back to Excel is a pain.

I'd be surprised to know if there is any.
 
P

philo 

That's not the half of it.

1) PDFs exist, which were "made on a FAX machine".
You can't wipe your mouse over the text.
Only OCR works with those. There are various
means to get yourself a copy of an OCR.

Each page in the document, is a pixmap.

That has nothing to do with the OP's question
 
J

J. P. Gilliver (John)

JJ said:
That's not the half of it. []
2) There is a special obfuscation technique that takes
[]
IOTW, PDF is a digital printed document.
[]
_some_ of such documents might be accessible - with loss of formatting -
by using a "plain text" printer, set to print to file. But I suspect not
all! (If only because formatting - such as columns - will screw things
up.)
--
J. P. Gilliver. UMRA: 1960/<1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

Anybody who thinks there can be unlimited growth in a static, limited
environment, is either mad or an economist. - Sir David Attenborough, in
Radio Times 10-16 November 2012
 
J

J. P. Gilliver (John)

[QUOTE="Paul said:
That has nothing to do with the OP's question

"decent and free PDF to word converter"

OK, this is a PDF. Show me your copy and paste trick.

http://cds.linear.com/docs/en/datasheet/lt0412a.pdf

Paul[/QUOTE]

I shared your reaction to philo's rude reply!

However, we don't (well, I don't) know what philo is using to view PDFs.
I can conceive (though I don't know) that there might be PDF viewers
that have OCR built in. I'm pretty sure I have seen OCR softwares that
do accept PDF as input, and some of those may offer the illusion of cut
and paste - but "just" wouldn't be the right word to use there (-:! (I
wouldn't refer to an OCR - even if it accepted PDF - as a PDF viewer.)

I'll be interested to see his reply, if any, though!
--
J. P. Gilliver. UMRA: 1960/<1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

Anybody who thinks there can be unlimited growth in a static, limited
environment, is either mad or an economist. - Sir David Attenborough, in
Radio Times 10-16 November 2012
 
P

philo 

I shared your reaction to philo's rude reply!


I found Paul's replay to be rude actually...and yours as well.
Paul normally gives excellent advice but the OP specifically said he
wanted to convert a PDF document to Word.

He said /nothing/ about it being a scanned fax.
Since fax machines are rarely used anymore, it did not seem reasonable
to make the assumption that that's what the OP had in mind.
For that, obviously one would need OCR software.
 
P

Paul

I've been using an application called PDF to Word Doc Converter. Here's a link to their site: http://www.hellopdf.com/

Try converting this doc.

This is doubly protected.

http://www.2muslims.com/books/2discoverislam_com_riyad_us_saliheen.pdf

*******

Copying is disabled to start. So copy and paste will not work.
That can be fixed, by using a patched version of XPDF, to disable
the copy protection bit. (I'm sure there are a ton of other
ways to do that, but it's the first one I got working at the time.)

The second problem, is the obfuscated font technique.
If you wipe over some text, and paste into Notepad or Wordpad,
you get "lots of squares" and those are values below normal
printable ASCII (as they're actually indexes into a table).

I'd be interested what these various free tools make of that test doc.
And whether they can do anything with it.

When someone brought this type of document to my attention some months ago,
my initial impression was that only some form of OCR technique
would make it copyable as text (copy/paste). That's what I suggested,
as the person posting the question was in a hurry. I'm curious
whether any of these free converters, just "give up" on straightforward
conversion, and just OCR all of them...

Paul
 
P

Paul

Paul said:
Try converting this doc.

This is doubly protected.

http://www.2muslims.com/books/2discoverislam_com_riyad_us_saliheen.pdf

*******

Copying is disabled to start. So copy and paste will not work.
That can be fixed, by using a patched version of XPDF, to disable
the copy protection bit. (I'm sure there are a ton of other
ways to do that, but it's the first one I got working at the time.)

The second problem, is the obfuscated font technique.
If you wipe over some text, and paste into Notepad or Wordpad,
you get "lots of squares" and those are values below normal
printable ASCII (as they're actually indexes into a table).

I'd be interested what these various free tools make of that test doc.
And whether they can do anything with it.

When someone brought this type of document to my attention some months ago,
my initial impression was that only some form of OCR technique
would make it copyable as text (copy/paste). That's what I suggested,
as the person posting the question was in a hurry. I'm curious
whether any of these free converters, just "give up" on straightforward
conversion, and just OCR all of them...

Paul

OK, I tried http://www.convertfiles.com/ and
submitted an obfuscated single page as a test,
and it converted it! Results are excellent.
I had to use my file with the copy protection
step removed, in order to extract just page 128
for conversion. Format appears to be RTF (Rich Text Format),
when examined with a hex editor

http://dw4.convertfiles.com/files/0187627001375532272/heenp128.doc

The conversion is a little slow, but I suspect
they wanted me to look at the adverts :)

Paul
 
J

J. P. Gilliver (John)

philo  <[email protected]> said:
I found Paul's replay to be rude actually...and yours as well.

Paul is rarely rude - less so than I, I think.
Paul normally gives excellent advice but the OP specifically said he
wanted to convert a PDF document to Word.

He said /nothing/ about it being a scanned fax.
Since fax machines are rarely used anymore, it did not seem reasonable
to make the assumption that that's what the OP had in mind.
For that, obviously one would need OCR software.
OK, Paul was a little lazy in using "fax" to mean "scanner". I very much
doubt the example he quoted above (data sheet from Linear Technologies)
was actually made on a fax machine.The fact remains that a fair proportion - I'd say 5 to 10 per cent, but
it will vary a lot by context - of the PDFs around _are_ just scanned
images (sometimes in colour, so obviously _not_ done on a fax machine).
Your "just cut and paste" [you meant copy and paste (-:] was probably
meant to be helpful; Paul's tendency to give complete replies meant he
couldn't let the OP think that that would work in all circumstances.

"Flash", if you're still with us: have you managed to get your "2-4
documents" into Word? If so, what did you use? (What were/are they?)
--
J. P. Gilliver. UMRA: 1960/<1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf

"Anything else you'd like me to do while I'm at it? Paint the sky green? Bury
the odd elephant I find lying around ..." - Tidy, the Android - Earthsearch II,
part 2. (1982-5-2)
 
P

philo 

X

I found Paul's replay to be rude actually...and yours as well.

Paul is rarely rude - less so than I, I think.
Paul normally gives excellent advice but the OP specifically said he
wanted to convert a PDF document to Word.

He said /nothing/ about it being a scanned fax.
Since fax machines are rarely used anymore, it did not seem reasonable
to make the assumption that that's what the OP had in mind.
For that, obviously one would need OCR software.
OK, Paul was a little lazy in using "fax" to mean "scanner". I very much
doubt the example he quoted above (data sheet from Linear Technologies)
was actually made on a fax machine.The fact remains that a fair proportion - I'd say 5 to 10 per cent, but
it will vary a lot by context - of the PDFs around _are_ just scanned
images (sometimes in colour, so obviously _not_ done on a fax machine).
Your "just cut and paste" [you meant copy and paste (-:] was probably
meant to be helpful; Paul's tendency to give complete replies meant he
couldn't let the OP think that that would work in all circumstances.

"Flash", if you're still with us: have you managed to get your "2-4
documents" into Word? If so, what did you use? (What were/are they?)


In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.

The OP simply said "PDF" and nothing more.

At any rate even a scanned image can be cut and pasted into Word...
it will not be editable text of course.


Anyway I'm sure the OP has well been scared off of Usenet by now
 
J

J. P. Gilliver (John)

In message <[email protected]>,
I've been using an application called PDF to Word Doc Converter. Here's
a link to their site: http://www.hellopdf.com/

Hmm. I suppose can't complain as it's free, but:

I tried it with Paul's datasheet, expecting it to produce just images.
It rapidly shot up to 99%, then sat there until I gave up.

I tried it with a more conventional PDF - a UK act of parliament (Mobile
Homes Act 2013, chapter 14 - 36 pages, all text other than the royal
crest on a few pages and the barcode on the last page). First, I tried
with "Uses Text-Box" unticked, as that generally produces something not
much more editable than the original PDF, though with all the text in
the right place. This produced 2 pages - one with the crest and the
title, the other blank. So I tried with Uses Text-box ticked. This -
quite quickly - produced 36 pages, laid out as the original. (It had
removed some horizontal lines, and chosen much taller pages than the
original, but I could live with that.) Each line of text in the original
(not each sentence: each line) was in a separate, positioned, box. An
original 535 KB .pdf file generated a 24,662 KB .doc file.

While I've been typing this, I let it have another go at Paul's
datasheet; it eventually finished, producing, as I expected, a Word
document containing four inages. This time, the original 487 KB .pdf has
created a 271,952 KB .doc!!! No wonder it took so long!

Again, I feel churlish for criticising it because it's free, but I'm not
sure what it does for me that philo's cut (really copy) and paste from
the PDF doesn't.
 
J

J. P. Gilliver (John)

Original question: Little off topic but trying to find a decent and free
PDF to word converter for 2-4 documents. Any suggestions are welcomed,
TIA


philo  <[email protected]> said:
On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.

The OP simply said "PDF" and nothing more.

Well, you made one about the nature of the .pdf files! Quite possibly a
valid one. But not everybody - possibly including the OP - _realises_
that some PDFs are just scanned images.
At any rate even a scanned image can be cut and pasted into Word...
it will not be editable text of course.
True!

Anyway I'm sure the OP has well been scared off of Usenet by now
Well, I hope not usenet, but I'd be surprised if he's still reading this
thread (-:!
 
P

philo 

Original question: Little off topic but trying to find a decent and free
PDF to word converter for 2-4 documents. Any suggestions are welcomed, TIA


On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.

The OP simply said "PDF" and nothing more.

Well, you made one about the nature of the .pdf files! Quite possibly a
valid one. But not everybody - possibly including the OP - _realises_
that some PDFs are just scanned images.



<snip>


yep I made an assumption

OTOH: if someone asks me whats one plus one...

I am going to make the assumption they are using base 10
 
J

J. P. Gilliver (John)

philo  <[email protected]> said:
Original question: Little off topic but trying to find a decent and free
PDF to word converter for 2-4 documents. Any suggestions are welcomed, TIA


On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.

The OP simply said "PDF" and nothing more.

Well, you made one about the nature of the .pdf files! Quite possibly a
valid one. But not everybody - possibly including the OP - _realises_
that some PDFs are just scanned images.



<snip>


yep I made an assumption

OTOH: if someone asks me whats one plus one...

I am going to make the assumption they are using base 10
What base is your "10" in?

[Don't reply - just joshing (-:!]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top