F
Flash
Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA
converter for 2-4 documents. Any suggestions are welcomed, TIA
Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA
philo said:Just cut and paste your pdf document into Word, that's all there is to it
That's not the half of it.
1) PDFs exist, which were "made on a FAX machine".
You can't wipe your mouse over the text.
Only OCR works with those. There are various
means to get yourself a copy of an OCR.
Each page in the document, is a pixmap.
There are no text strings. Only images, one per page.
2) There is a special obfuscation technique that takes
advantage of the double indirection of PDF text
representations. It allows the on-screen view to
have perfectly good, readable letters. But if you wipe
over the text, the copy/paste buffer has "garbage" in it.
I actually "fixed" a document like that, when someone presented
such a document as an example. It took me two weeks, a number
of scripts, but I eventually returned the mapping to "linear".
Such that, if you wipe over the text, the copy/paste buffer
has the real goods in it.
Certain third-party word processing tools, possess that
capability (controlled obfuscation).
Details of the method, here. This is the idiot that started it.
http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF
Such damage, could not (easily) be recovered by a PDF to DOCX
tool. That's because each word processing developer, uses
slightly different variations on a common theme.
So while it's fun to pretend that someone is going to write
the "perfect PDF converter", you could be in for a rude
shock if you test it for long enough. There's some stuff
that is *really hard* to copy. And that's because the
industry no longer relies on those nifty "security" features
for total protection. The obfuscation method, means it's a whole
new ballgame.
That's not the half of it.
1) PDFs exist, which were "made on a FAX machine".
You can't wipe your mouse over the text.
Only OCR works with those. There are various
means to get yourself a copy of an OCR.
Each page in the document, is a pixmap.
philo said:That has nothing to do with the OP's question
[]JJ said:[]That's not the half of it. []
2) There is a special obfuscation technique that takes
IOTW, PDF is a digital printed document.
[QUOTE="Paul said:That has nothing to do with the OP's question
Little off topic but trying to find a decent and free PDF to word converter
for 2-4 documents. Any suggestions are welcomed, TIA
Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA
I shared your reaction to philo's rude reply!
I've been using an application called PDF to Word Doc Converter. Here's a link to their site: http://www.hellopdf.com/
Paul said:Try converting this doc.
This is doubly protected.
http://www.2muslims.com/books/2discoverislam_com_riyad_us_saliheen.pdf
*******
Copying is disabled to start. So copy and paste will not work.
That can be fixed, by using a patched version of XPDF, to disable
the copy protection bit. (I'm sure there are a ton of other
ways to do that, but it's the first one I got working at the time.)
The second problem, is the obfuscated font technique.
If you wipe over some text, and paste into Notepad or Wordpad,
you get "lots of squares" and those are values below normal
printable ASCII (as they're actually indexes into a table).
I'd be interested what these various free tools make of that test doc.
And whether they can do anything with it.
When someone brought this type of document to my attention some months ago,
my initial impression was that only some form of OCR technique
would make it copyable as text (copy/paste). That's what I suggested,
as the person posting the question was in a hurry. I'm curious
whether any of these free converters, just "give up" on straightforward
conversion, and just OCR all of them...
Paul
philo <[email protected]> said:I found Paul's replay to be rude actually...and yours as well.
OK, Paul was a little lazy in using "fax" to mean "scanner". I very muchPaul normally gives excellent advice but the OP specifically said he
wanted to convert a PDF document to Word.
He said /nothing/ about it being a scanned fax.
Since fax machines are rarely used anymore, it did not seem reasonable
to make the assumption that that's what the OP had in mind.
For that, obviously one would need OCR software.
X
I found Paul's replay to be rude actually...and yours as well.
Paul is rarely rude - less so than I, I think.
OK, Paul was a little lazy in using "fax" to mean "scanner". I very muchPaul normally gives excellent advice but the OP specifically said he
wanted to convert a PDF document to Word.
He said /nothing/ about it being a scanned fax.
Since fax machines are rarely used anymore, it did not seem reasonable
to make the assumption that that's what the OP had in mind.
For that, obviously one would need OCR software.
doubt the example he quoted above (data sheet from Linear Technologies)
was actually made on a fax machine.The fact remains that a fair proportion - I'd say 5 to 10 per cent, but
it will vary a lot by context - of the PDFs around _are_ just scanned
images (sometimes in colour, so obviously _not_ done on a fax machine).
Your "just cut and paste" [you meant copy and paste (-:] was probably
meant to be helpful; Paul's tendency to give complete replies meant he
couldn't let the OP think that that would work in all circumstances.
"Flash", if you're still with us: have you managed to get your "2-4
documents" into Word? If so, what did you use? (What were/are they?)
I've been using an application called PDF to Word Doc Converter. Here's
a link to their site: http://www.hellopdf.com/
philo <[email protected]> said:On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.
The OP simply said "PDF" and nothing more.
Well, I hope not usenet, but I'd be surprised if he's still reading thisAt any rate even a scanned image can be cut and pasted into Word...
it will not be editable text of course.
True!
Anyway I'm sure the OP has well been scared off of Usenet by now
Little off topic but trying to find a decent and free PDF to word
converter for 2-4 documents. Any suggestions are welcomed, TIA
Original question: Little off topic but trying to find a decent and free
PDF to word converter for 2-4 documents. Any suggestions are welcomed, TIA
On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.
The OP simply said "PDF" and nothing more.
Well, you made one about the nature of the .pdf files! Quite possibly a
valid one. But not everybody - possibly including the OP - _realises_
that some PDFs are just scanned images.
What base is your "10" in?philo <[email protected]> said:Original question: Little off topic but trying to find a decent and free
PDF to word converter for 2-4 documents. Any suggestions are welcomed, TIA
On 08/03/2013 08:28 AM, J. P. Gilliver (John) wrote: []
In short, when someone posts a question , unless they state otherwise...
I just take it in it's most basic terms. I think Paul made an
assumption that just plain was not there.
The OP simply said "PDF" and nothing more.
Well, you made one about the nature of the .pdf files! Quite possibly a
valid one. But not everybody - possibly including the OP - _realises_
that some PDFs are just scanned images.
<snip>
yep I made an assumption
OTOH: if someone asks me whats one plus one...
I am going to make the assumption they are using base 10