jim said:
On Sat, 10 Mar 2012 17:49:06 -0500, in
Wow, Paul, you did a lot here. I am passing all of this on to my friend.
Thanks,
jim
It turns out, there is a simple substitution involved. (Simple in the
cryptography sense, not in the practical implementation sense.)
The idea of "damaging" documents in that way, starts with ideas here.
http://spivey.oriel.ox.ac.uk/corner/Obfuscated_PDF
I managed to fix the test document, using three tools and
a lot of hand-crafted scripts (for testing).
XPDF is a tool (typically see on Linux platforms), and it has a
dandy function for converting PDF to PS. It has a "copy" check, and
checks the "copy" status of the document. With no source modification,
it won't convert the test document. Mayayana mentions the "doc->okToCopy"
fix, and there are several places in the source where that check can be
commented out. On the one hand, the author of the program doesn't
want the function defeated, but on the other hand, the software
doesn't try to hide where the function is checked. A recompile,
and I could do conversions with it.
Ghostscript (ps2ascii) was originally written to handle dvi2ps output.
Dvi2ps was a conversion tool, noted for some pretty dreadful looking copy
when you were finished. I've received a few manuals years ago, in dvi2ps
format, and they were hard to read because of the fonts. The ps2ascii
was written, as a means of converting the postscript documents, back
into text. It was never intended to be a general purpose tool (to handle
any PS or PDF that comes along).
The third tool used, was Fontforge. It's available in my Linux VM, and
that's all I could find for editing fonts.
I converted from PDF to postscript, using XPDF and ps2ascii. In the case
of ps2ascii, the tool (set to COMPLEX mode in the .bat file) will output
font calls and text strings. They in turn, can be converted back into
PostScript font calls, and the text strings into "x y moveto (string) show"
type constructs. Basically, I can build a new PostScript file, using the
fonts and strings that ps2ascii emits.
Using the XPDF output, I get 15 subsetted fonts. You have to put those
into separate files, and then Fontforge can read all of the fonts.
(Initially, the fonts wouldn't read, when there were too many in one file.)
The fonts need to have the letters moved around, in the boxes, as
part of the repair. On some of the fonts, this required selecting
"ReEncode" to create a new encoding. On others, the encoding was close
enough, that moving a few letters did the job. This is an example
of a font, after the letters have been "put back in the right holes".
http://img171.imageshack.us/img171/463/fontforge.gif
Using the same fonts, you look at how the position of the characters
in the font, differ from their "normal" position. Comparing Courier
font (regular encoding) to the jumbled font, I can make translation
tables. Creating a small test PostScript file, and using the ( ) show
command 256 times, dumps a 16x16 table of letters for examination.
http://img254.imageshack.us/img254/8181/fonttable.gif
Then, I can make my translation tables.
z="SZNMKL\+MSTT31c55100" # /F1147_0
font[z,"\$"] = "A"
font[z,"\%"] = "B"
It turned out, that ps2ascii gave output I could convert
easily with a script. But ps2ascii, doesn't intercept the
font sizes correctly. About four of the fifteen fonts, caused
a problem. And adjusting the font size by hand, wasn't going to
work either. I tried some simple approaches first, but the
quality wasn't there. (The idea is, to make all the colored
text sections, not overlap, as a metric for quality.)
So then, I captured the font sizes for the characters, as captured
in the XPDF output, and made a roughly two million entry table from
it. Then, had the program that reads the psascii output, also read
in the two million entry table (one entry per letter). From that,
I could associate the correct font size with each letter. (See a
letter in psascii output, verify the same letter was next in the
XPDF table, copy the font size from the XPDF table, and get some
better font size info for output.)
With the letters translated, and the font table entries pushed around,
this is what the final output looks like. If you wipe the mouse over
any of the text, and do a copy/paste, the text is pretty good. (Occasionally,
a "space" character goes missing, as in the example text below. Also in that
sample, the handling of the hyphen was screwed up when copy/pasted.) But if
you zoom into this picture, and look at the letters in the font, something
happened to this particular font when I re-generated it. There is
something wrong with the baseline of the individual characters.
(The left and center leg of the small "m" are lifted a bit, making
the "m" look tilted.) It would appear, when I used the copy/paste
within Fontforge, some auxiliary information didn't get copied at
the same time. Which is why you'd use a program like that in the
first place, to get details like that correct. In any case, you
could still read the document in its current form, it just doesn't
look its best.
http://img189.imageshack.us/img189/6286/samplepdf.gif
(And seven lines, copy/paste from the third paragraph.)
6. Abu Ishaq Sa‘d bin Abu Waqqas (May Allah be pleased with him) (one of the ten who had been
given the glad tidings of entry into Jannah) narrated: Messenger of Allah (PBUH) visited me in my
illness which became severe in the year of Hajjat-ul-Wada‘ (Farewell Pilgrimage). I said, "O
Messenger of Allah, you can see the pain which I am suffering and I am a man of means and there is
none to inherit from me except one daughter. Should I give two-thirds of my property in charity?’’ He
(PBUH) said, "No". I asked him, "Then half?’’ He said, "No". Then I asked, "Can I give away onethird".
He said, "Give away one-third, and that is still too much. It is better to leave your heirs well-off
The method I used, can't be turned into general purpose software,
because the Fontforge step is effectively "human OCR". It reduces
the error rate, in the sense that I only have to get about 300
character translations right, to get the ~2 million character
document right. But it's still a step where software couldn't
help. The thing is, the obfuscation method, breaks the binding of
a "tag" to a set of drawing commands, and without a means of
identifying what a given set of font drawing commands do, you can't
figure it out. It would take an OCR algorithm, driving the Fontforge
step, to automate this.
Printing the document to TIFF format files, one per page, and doing
OCR on that, will lead to a lot of character errors and spacing errors
as well. The translation I did, still needs a lot of work. In particular,
there are a number of purposeful typos in the document. Like the use of
two '' instead of a " , as in the sample above. One typo was so purposeful,
they replaced the letter "i" with a completely different font (introduced
only to draw that one letter), and doing a font change in the middle of a
word is not something that happens from "fat fingers". It requires
forethought.
Paul