Puzzling errors copying PDF text

T

Terry Pinnell

This is admittedly OT, as I strongly doubt if it's an XP problem. But
maybe someone here has encountered similar odd behaviour and can give me a
pointer as to its cause please.

A friend sent me a trip report as a PDF. (Not sure why he chose that
instead of plain text, because text is its only content.) When I selected
it with the Text tool in PDF Viewer and pasted it into my text editor,
every occurrence of 'fi' and 'fl' had been replaced with '?'. No other
errors, just those.

Any ideas please? It's more the sort of thing I'd expect from OCR.
 
T

Terry Pinnell

David H. Lipman said:
That's probably what it was, software did OCR on a scanned object.

Look in the PDF Properties and see what created the PDF.

Thanks. My friend's husband apparently typed it with what Document
Properties describes as 'iPhone OS 6.1.3 Quartz PDFContext'.
 
P

Paul

Terry said:
Thanks. My friend's husband apparently typed it with what Document
Properties describes as 'iPhone OS 6.1.3 Quartz PDFContext'.

It's possible to protect documents against Copy & Paste, using a
font mangling technique. It was first suggested here.

http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF

Your problem is just a "regular" problem with PDF, rather than
the willful damage ("protection") afforded by some tool flows.

I repaired such an Obfuscated document, just to see if it could
be done, and it took two weeks of scripting and experiments to do it,
and get the resulting document close to original form (so when
you copy and pasted, the text matched what was on the screen).
XPDF was indispensable, as was a font editing program (to repair
each font table). Even with OCR to fix the font tables, I doubt
such a repair could be completely automated. So if you want
a purposefully obfuscated document fixed, it's a *lot* of work.

*******

As to the nature of your problem, fi and fl are ligatures.

http://en.wikipedia.org/wiki/Typographic_ligature

That article even has examples of your two character sequences.

http://en.wikipedia.org/wiki/File:Ligature_drawing.svg

The fix is pretty simple. Just a matter of recognizing the
Unicode for those things, and translating it back to text.
Try copy and paste in a word processing environment, to
see if that's a more compliant transport, than pasting
into Notepad. Then do a "find and replace", using the
ligature character for the find, and "fi" for the replace etc.
That will translate the one 16 bit character, into two 8 bit
characters.

Further down the page in the ligature article, they have
the mapping...

fi ligature = U+FB01
fl ligature = U+FB02

I wonder if pasting into a hex editor would work ? Then,
translate FB01 hex, into the hex for the appropriate
two ASCII characters.

I love projects like this. That's what makes computers fun.
Having to do all the work manually :)

Have fun,
Paul
 
T

Terry Pinnell

Paul said:
It's possible to protect documents against Copy & Paste, using a
font mangling technique. It was first suggested here.

http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF

Your problem is just a "regular" problem with PDF, rather than
the willful damage ("protection") afforded by some tool flows.

I repaired such an Obfuscated document, just to see if it could
be done, and it took two weeks of scripting and experiments to do it,
and get the resulting document close to original form (so when
you copy and pasted, the text matched what was on the screen).
XPDF was indispensable, as was a font editing program (to repair
each font table). Even with OCR to fix the font tables, I doubt
such a repair could be completely automated. So if you want
a purposefully obfuscated document fixed, it's a *lot* of work.

*******

As to the nature of your problem, fi and fl are ligatures.

http://en.wikipedia.org/wiki/Typographic_ligature

That article even has examples of your two character sequences.

http://en.wikipedia.org/wiki/File:Ligature_drawing.svg

The fix is pretty simple. Just a matter of recognizing the
Unicode for those things, and translating it back to text.
Try copy and paste in a word processing environment, to
see if that's a more compliant transport, than pasting
into Notepad. Then do a "find and replace", using the
ligature character for the find, and "fi" for the replace etc.
That will translate the one 16 bit character, into two 8 bit
characters.

Further down the page in the ligature article, they have
the mapping...

fi ligature = U+FB01
fl ligature = U+FB02

I wonder if pasting into a hex editor would work ? Then,
translate FB01 hex, into the hex for the appropriate
two ASCII characters.

I love projects like this. That's what makes computers fun.
Having to do all the work manually :)

Have fun,
Paul

Thanks a bunch, Paul, lots more to this issue than I thought!

(But I wish he'd just typed their travel report straight into an email!)
 
P

Paul

Terry said:
Thanks a bunch, Paul, lots more to this issue than I thought!

(But I wish he'd just typed their travel report straight into an email!)

I tried a few experiments here, and the results weren't
very encouraging.

There is a wealth of converter applications out there,
but I imagine you could waste a lot of time testing them.
With no guarantee they can handle the file.

Converting the file to an image, then using OCR on it,
is bound to yield even more problems. And since I've
"been there before", I'm not going near that with a
barge pole. I don't recommend that approach, unless
you're really desperate.

*******

I used LibreOffice as a test vehicle, inserted a ligature
as a special character, printed to PostScript, then tried
to filter that back to basic text. And I get the impression,
modern word processing tools are way too clever, for the
methods we used back in 2004. That's why I don't feel
optimistic just any old tool will work for this. LibreOffice
re-encoded the font, and my sample text was not present
as a "plain text" string inside the file it produced.
That's an example of how evil these tools have become.
The output still "looks" fine, it's just not amenable
to the old, simple minded filter operations.

Paul
 
T

Terry Pinnell

Paul said:
I tried a few experiments here, and the results weren't
very encouraging.

There is a wealth of converter applications out there,
but I imagine you could waste a lot of time testing them.
With no guarantee they can handle the file.

Converting the file to an image, then using OCR on it,
is bound to yield even more problems. And since I've
"been there before", I'm not going near that with a
barge pole. I don't recommend that approach, unless
you're really desperate.

*******

I used LibreOffice as a test vehicle, inserted a ligature
as a special character, printed to PostScript, then tried
to filter that back to basic text. And I get the impression,
modern word processing tools are way too clever, for the
methods we used back in 2004. That's why I don't feel
optimistic just any old tool will work for this. LibreOffice
re-encoded the font, and my sample text was not present
as a "plain text" string inside the file it produced.
That's an example of how evil these tools have become.
The output still "looks" fine, it's just not amenable
to the old, simple minded filter operations.

Paul

Thanks for the follow-up, Paul, and sorry for my delay in acknowledging.

The task is definitely not worth that sort of effort!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top