Replacing a string inside of a PDF

J

Josh Baltzell

I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.

Anyone know what I have to do?
 
S

Samuel Shulman

Please explain, are you trying to read the file using a binary string and
then using a binary string you try to write another file
 
J

Josh Baltzell

Samuel,

I have tried it several ways. The end goal is just to end up with an
edited PDF. If I have to overwrite the original file that is fine.
 
J

Josh Baltzell

I'm assuming that I should somehow be using a binaryreader and a
binarywriter, I just don't know how to work with the data inside as
strings and then put it back in to the writer.
 
S

Samuel Shulman

I think that the key to your question is how to actually read the file (I
should have realized before that this is the main issue),

Did you manage to read parts of the file only if you can do that you can
replace the text
 
J

Josh Baltzell

I have written the code to at least read the internals of the file as a
string or a stream and then I can find the chunk I want to replace easy
enough, but I think it loses some special characters, or maybe screws
up the line endings (PDF files have mac style CR only instead of CR LF
like a lot of windows based files have I believe.)

So I guess my problem is actually reading and writing. I can write
code that looks like I am reading it with a streamreader, but I think I
am really losing data. I can write code that reads it as binary, but
then I have trouble working with the contents. After all that is
worked out I have to figure out how to write the edited file back to
disk (I believe the binary writer will do that, but I have not tested
much.)

I'm not sure what else I can tell you, This is just a matter of me not
fully understanding how I am supposed to read and edit a file like this
as opposed to the other formats that I have worked with that were all
plain text.

Thanks a lot for the feedback. I looked at the other post you linked
to and read the linked page. I think that would be useful to me if the
PDFs were compressed, but I can open these in Notepad and find my
string right now (and that works when I do the edit that way.)
 
S

Samuel Shulman

You may be able to create identical string to the one that you want to
replace then send it to a binary stream (it doesn't have to be a file) then
look for such a binary sequence within the main binary stream (binary
buffer) that holds the pdf file and replace it with another binary stream
created from the string you wanted to use for the replacement


You still have the problem of the funny characters which you can imitate by
adding CR instead of the CRLF (or what is the normal)

And finally, once the code will work please send it over it seems
interesting to me (if it is OK with you/your company)

Regards,
Samuel
 
J

Josh Baltzell

I'm not sure I know how to do what you are saying, but here is a test I
made to write the file using a string converted in to a bytearray.
This is not working.

:::::::::::::::::::::::::::::::::::::::::::::::::::
Public Function ByteTest()
Dim PDFFile As String
Dim PDFFolder As IO.Directory

Response.Write("Start Byte:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

For Each PDFFile In PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the file
Dim FileStream As IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace("ABC1234567890",
"ABC1111111111")

'Close stream
FileStream.Close()

'Create byte based output file
Dim OutputFileName As String = Server.MapPath("PDFOutput\"
& DateTime.Now.ToFileTimeUtc.ToString & "BYTE.pdf")
Dim fs As FileStream = File.Create(OutputFileName)
fs.Close()

'Convert the string to bytes
Dim info As Byte() = New
System.Text.UTF8Encoding(True).GetBytes(Contents)

'Write string as bytes to output file
fs = File.OpenWrite(OutputFileName)
fs.Write(info, 0, info.Length)
fs.Close()

Next

Response.Write("Stop Byte:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

End Function
:::::::::::::::::::::::::::::::::::::::::::::::::::
 
J

Josh Baltzell

Here is another test I wrote that sucessfully generates a bunch of
useless files encoded in different ways.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Public Function StringTest()
Dim PDFFile As String
Dim PDFFolder As IO.Directory

Response.Write("Start String:" & DateTime.Now.ToLongTimeString
& ":" & Now.Millisecond & "<br>")

For Each PDFFile In PDFFolder.GetFiles(Server.MapPath("PDF"))
'Open the file
Dim FileStream As IO.StreamReader
FileStream = IO.File.OpenText(PDFFile)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace("ABC1234567890",
"ABC1111111111")

'Close stream
FileStream.Close()

'Create ASCII output file
Dim OutputFileName As String = Server.MapPath("PDFOutput\"
& DateTime.Now.ToFileTimeUtc.ToString & "STRING-ASCII.pdf")
Dim fs As FileStream = File.Create(OutputFileName)
Dim PDFStream As StreamWriter = New StreamWriter(fs,
System.Text.Encoding.ASCII)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create BigEndianUnicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-BigEndianUnicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.BigEndianUnicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create default formatted output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Default.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Default)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create Unicode output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-Unicode.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs,
System.Text.Encoding.Unicode)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF7 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF7.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF7)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

'Create UTF8 output file
OutputFileName = Server.MapPath("PDFOutput\" &
DateTime.Now.ToFileTimeUtc.ToString & "STRING-UTF8.pdf")
fs = File.Create(OutputFileName)
PDFStream = New StreamWriter(fs, System.Text.Encoding.UTF8)
PDFStream.Write(Contents)
PDFStream.Close()
fs.Close()

Next

Response.Write("Stop String:" & DateTime.Now.ToLongTimeString &
":" & Now.Millisecond & "<br>")

End Function
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 
B

Branco Medeiros

Josh said:
I am having a lot more trouble with this than I thought I would. Here
is what I want to do in pseudocode.

Open c:\some.pdf
Replace "Replace this" with "Replaced!"
Save c:\some_edited.pdf

I can do this in notepad and it works fine, but when I start getting in
to reading the files I think it has some encoding problem. I tried
saving the file with every encoding option. When I open a PDF in the
text editor I normally use it says it is ANSI with Mac style carriage
returns. Winmerge will not let me compare the files because it says
they are binary.
<snip>

Winmerge is right, a PDF file is actually a binary image, not a plain
text in a given encoding. You should load it as a stream of bytes.

On the other hand, since you want to perform text replacements in the
file, you may load it with an encoding that doesn't apply
transformations on the bytes in the file, such as the Ansi encoding:

Sub PDFReplaceText(ByVal Path As String, ByVal OldText As String, _
ByVal OutPath As String, ByVal NewText As String)

Const ANSI As Integer = 1252

Dim Encoding As Text.Encoding = Text.Encoding.GetEncoding(ANSI)
Dim sr As New IO.StreamReader(Path, Encoding)
Dim Data As String = sr.ReadToEnd
sr.Close()

Data = Data.Replace(OldText, NewText)

Dim sw As New IO.StreamWriter(OutPath, False, Encoding)
sw.Write(Data)
sw.Close()

End Sub

HTH.

Regards,

Branco.
 
J

Josh

Branco,

This worked perfect. My knowlege about the encoding options in general
is very weak, so thanks for spelling it out for me with some code.

Samuel,

Thank you to you too. You have both been a big help.

Thank you,
Josh Baltzell
 
J

Josh

I put the encoding options in to my own code, so I am not positive.
This is the final sub I ended up with.

Public Sub ReplaceText(ByVal FilePath As String, ByVal OriginalText
As String, ByVal NewText As String)
Dim PDFFolder As IO.Directory
Dim Encoding As System.Text.Encoding =
Encoding.GetEncoding(1252)

'Open the file
Dim FileStream As New IO.StreamReader(FilePath, Encoding)

'Load the file in to a string
Dim Contents As String = FileStream.ReadToEnd

'Replace text in string
Contents = Contents.Replace(OriginalText, NewText)

'Close stream
FileStream.Close()

'Write string as bytes to output file
Dim OutputFileName As String = FilePath
Dim sw As New IO.StreamWriter(OutputFileName, False, Encoding)
sw.Write(Contents)
sw.Close()

End Sub
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top