trouble reading word documents

C

Co

Hi All,

I use a code that creates a FileStream to open and read the content of
a word document.
I want to save the text as plain text to a database.
Now I have a code that reads UTF-8 encoding but that doesn't always
work:

Sub readdoc2(ByVal sPathName As String)

Dim temp As UTF8Encoding = New UTF8Encoding(True)
Dim fs As FileStream = File.OpenRead(sPathName)
Dim b(1024) As Byte

Do While fs.Read(b, 0, b.Length) > 0
Me.RichTextBox1.Text &= temp.GetString(b, 0, b.Length)
Loop

fs.Close()

End Sub

Some documents need my other code:

Sub readdoc(ByVal sPathName As String)

Dim fs As FileStream = File.OpenRead(sPathName)
Dim d As New StreamReader(fs)

'creating a new StreamReader and passing the filestream object
fs as argument
d.BaseStream.Seek(0, SeekOrigin.Begin)
'Seek method is used to move the cursor to different positions
in a file, in this code, to
'the beginning

While d.Peek() > -1
'peek method of StreamReader object tells how much more
data is left in the file
Me.RichTextBox1.Text &= d.ReadLine()
End While
d.Close()

End Sub

Anyway I end up with some strange characters which I first have to
remove before I can save the
text to the database.

Is there no way you can get the text from a document without having to
remove these unreadable
characters?

Regards
Marco
The Netherlands
 
A

Armin Zingler

Co said:
Hi All,

I use a code that creates a FileStream to open and read the content of
a word document.
I want to save the text as plain text to a database.
Now I have a code that reads UTF-8 encoding but that doesn't always
work:

You can not handle a .doc file like a plain text file. It's stored in a
(proprietary) binary format.

If you have really a lot of time to read:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx


Armin
 
A

Armin Zingler

Co said:
Hi All,

I use a code that creates a FileStream to open and read the content of
a word document.
I want to save the text as plain text to a database.
Now I have a code that reads UTF-8 encoding but that doesn't always
work:

You can not handle a .doc file like a plain text file. It's stored in a
(proprietary) binary format.

If you have really a lot of time to read:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx


Armin
 
D

Dale Atkin

You can not handle a .doc file like a plain text file. It's stored in a
(proprietary) binary format.

If you have really a lot of time to read:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside
the zip file, you'll find a bunch of xml files. \word\document.xml contains
the text data for the file (along with some other stuff you'll need to parse
out).

Don't know if this is useful to your particular situation or not...

Dale
 
D

Dale Atkin

You can not handle a .doc file like a plain text file. It's stored in a
(proprietary) binary format.

If you have really a lot of time to read:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside
the zip file, you'll find a bunch of xml files. \word\document.xml contains
the text data for the file (along with some other stuff you'll need to parse
out).

Don't know if this is useful to your particular situation or not...

Dale
 
N

Number Eleven - GPEMC!

Dale Atkin said:
The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside
the zip file, you'll find a bunch of xml files. \word\document.xml contains
the text data for the file (along with some other stuff you'll need to parse
out).

Don't know if this is useful to your particular situation or not...

Dale

Actually for me, that was enormously helpful.
Thanks Dale Atkin, Armin Zingler, and Co.

Is there a zip/unzip function in VB2005 that can be used to expose the XML
inside docx (etc.) formats...?

Thanks in Advance...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the (e-mail address removed) to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com; http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
 
N

Number Eleven - GPEMC!

Dale Atkin said:
The new office format (ie. .docx, pptx, etc) is actually a zip file. Inside
the zip file, you'll find a bunch of xml files. \word\document.xml contains
the text data for the file (along with some other stuff you'll need to parse
out).

Don't know if this is useful to your particular situation or not...

Dale

Actually for me, that was enormously helpful.
Thanks Dale Atkin, Armin Zingler, and Co.

Is there a zip/unzip function in VB2005 that can be used to expose the XML
inside docx (etc.) formats...?

Thanks in Advance...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the (e-mail address removed) to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com; http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
 
T

Tom Shelton

Actually for me, that was enormously helpful.
Thanks Dale Atkin, Armin Zingler, and Co.

Is there a zip/unzip function in VB2005 that can be used to expose the XML
inside docx (etc.) formats...?

Thanks in Advance...

I recommend SharpZipLib:
http://www.icsharpcode.net/OpenSource/SharpZipLib/

Unless your using .NET 3.0 in your VS2005 - because then you can use
System.IO.Packaging. It has native support for the style of zip files that
word is using.
 
T

Tom Shelton

Actually for me, that was enormously helpful.
Thanks Dale Atkin, Armin Zingler, and Co.

Is there a zip/unzip function in VB2005 that can be used to expose the XML
inside docx (etc.) formats...?

Thanks in Advance...

I recommend SharpZipLib:
http://www.icsharpcode.net/OpenSource/SharpZipLib/

Unless your using .NET 3.0 in your VS2005 - because then you can use
System.IO.Packaging. It has native support for the style of zip files that
word is using.
 
D

Dale Atkin

What if I open Word, select all text and copy that to a string.
Then paste it into a richtextbox?

Is that an option for you? Sure you could do that.

Might even be able to work out a way to script doing that, or you might be
able to code some kind of macro within VBA to do what you want that would be
more efficient (do they still call it VBA?).

Dale
 
D

Dale Atkin

What if I open Word, select all text and copy that to a string.
Then paste it into a richtextbox?

Is that an option for you? Sure you could do that.

Might even be able to work out a way to script doing that, or you might be
able to code some kind of macro within VBA to do what you want that would be
more efficient (do they still call it VBA?).

Dale
 
N

Number Eleven - GPEMC!

Tom Shelton said:
I recommend SharpZipLib:
http://www.icsharpcode.net/OpenSource/SharpZipLib/

Unless your using .NET 3.0 in your VS2005 - because then you can use
System.IO.Packaging. It has native support for the style of zip files that
word is using.


Thank you - that's fantastic news for me...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the (e-mail address removed) to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com; http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
 
N

Number Eleven - GPEMC!

Tom Shelton said:
I recommend SharpZipLib:
http://www.icsharpcode.net/OpenSource/SharpZipLib/

Unless your using .NET 3.0 in your VS2005 - because then you can use
System.IO.Packaging. It has native support for the style of zip files that
word is using.


Thank you - that's fantastic news for me...

____________________________________________________________
Timothy Casey GPEMC - Eleven is the (e-mail address removed) to email.
Philosophical Essays: http://timothycasey.info
Speed Reading: http://speed-reading-comprehension.com
Software: http://fieldcraft.biz; Scientific IQ Test, Web Menus, Security.
Science & Geology: http://geologist-1011.com; http://geologist-1011.net
Technical & Web Design: http://web-design-1011.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top