reading text out of word docs

  • Thread starter Thread starter Keith G Hicks
  • Start date Start date
K

Keith G Hicks

I started working on a program to read text out of some well organized word
docs. I've done this sort of thing in vba but not quite this extensively and
I'm not great with word automation. I know enough to be dangerous. LOL. I
need to open the doc (got that part done), locate certain phrases that are
in all of them and then read some text after those phrases into variables so
I can post them to a sql db. The part I'm struggling with is how to read the
doc. I'm not changing the docs in any way. They are deposited into a folder
on the network and I open and read them as they arrive. Setting up the
watcher for this in general is not a problem. I just need help reading the
docs in vb.net.

Here's some of what I have so far:

oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)

Dim rng As Word.Range

With oWord.Selection
..HomeKey(wdStory)
rng = .Range
End With

rng.Find.Text = "Issue date::"
If rng.Find.Execute() Then
'MsgBox("found")
rng = oWord.Selection.Range
rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
MsgBox(rng)
Else
MsgBox("Not found")
End If

'move to linebelow "Issue Date:" to get county
Help with the above will really get me started well on this. I'd really
apprecate it.

Thanks,

Keith
 
Is this not possible?


Keith G Hicks said:
I started working on a program to read text out of some well organized word
docs. I've done this sort of thing in vba but not quite this extensively
and
I'm not great with word automation. I know enough to be dangerous. LOL. I
need to open the doc (got that part done), locate certain phrases that are
in all of them and then read some text after those phrases into variables
so
I can post them to a sql db. The part I'm struggling with is how to read
the
doc. I'm not changing the docs in any way. They are deposited into a
folder on the network and I open and read them as they arrive. Setting up
the watcher for this in general is not a problem. I just need help reading
the docs in vb.net.

Here's some of what I have so far:

oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)

Dim rng As Word.Range

With oWord.Selection
.HomeKey(wdStory)
rng = .Range
End With


rng.Find.Text = "Issue date::"
If rng.Find.Execute() Then
'MsgBox("found")
rng = oWord.Selection.Range
rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
MsgBox(rng)
Else
MsgBox("Not found")
End If


'move to linebelow "Issue Date:" to get county

Help with the above will really get me started well on this. I'd really
apprecate it.

Thanks,

Keith
 
Is this not possible?

You probably should ask in an MS Word group:
microsoft.public.word.*

You might be using VB.Net but the code you're
working on is MS Word object model. It will only
make sense to people who use MS Word and who
have experience with MS Word/Office automation.
 
Is this not possible?

I'm sure it is possible within word, but I would grab all the text, and
use regular expressions to search for the pattern you want. You only
seem to be using word, as that is the form of the original doc.
 
The problem is that there is no word vb.net group. Only vba. And as we all
know, they are very different. I did post a note there asking people to look
in this post if they have any ideas and so far nothing there either.
 
I did actually start trying that out yesterday. I'm taking the entire word
doc into a string variable. I hadn't started the RegEx part but I think
you're right. That's probalby the best way to go. I was hopign though that
someone out there had a better, less brute force way to do this.
 
The problem is that there is no word vb.net group.
Only vba. And as we all
know, they are very different.

Yes, that's what I meant. MS Office automation
is COM. You've got a COM object model, which is
adaptable to any COM-centric language. VB.Net is not
COM, so there's no direct translation. If it were me
I'd ask only in the Word group, get the VB/VBA code,
then figure out how to translate that to .Net. Even if
you were using a COM-centric language like VB or
VBScript, the Word group would still be the place
to ask, because your question is not about a language.
It's about the object model of the Word.Application
automation object.

Also, this may not help, but if you're dealing
only with .doc files (not .docx) and you're considering
just dealing with the text string as Family Tree Mike
suggested -- the .doc spec. has been published.
I think this is it:

http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7
AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf

I downloaded it when it was first released and wrote
a VBScript to extract text from .doc files. It seems
to work quite dependably. The details of plain text
storage in .doc files (as opposed to formatting, images,
etc.) are not very complex.
 
Moving this post to word.vba.general.

Keith

Keith G Hicks said:
I started working on a program to read text out of some well organized word
docs. I've done this sort of thing in vba but not quite this extensively
and
I'm not great with word automation. I know enough to be dangerous. LOL. I
need to open the doc (got that part done), locate certain phrases that are
in all of them and then read some text after those phrases into variables
so
I can post them to a sql db. The part I'm struggling with is how to read
the
doc. I'm not changing the docs in any way. They are deposited into a
folder on the network and I open and read them as they arrive. Setting up
the watcher for this in general is not a problem. I just need help reading
the docs in vb.net.

Here's some of what I have so far:

oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("C:\SomeWordDoc.doc", , True)

Dim rng As Word.Range

With oWord.Selection
.HomeKey(wdStory)
rng = .Range
End With


rng.Find.Text = "Issue date::"
If rng.Find.Execute() Then
'MsgBox("found")
rng = oWord.Selection.Range
rng.End = rng.Next(wdLine, 1).End ' rng.MoveEnd(wdLine)
MsgBox(rng)
Else
MsgBox("Not found")
End If


'move to linebelow "Issue Date:" to get county

Help with the above will really get me started well on this. I'd really
apprecate it.

Thanks,

Keith
 
I just gave this a try. It appears not difficult for a doc file--much
different for a docx file (which I didn't attempt).

From looking at the byte data of several files, I observed that

1. The body text starts at byte number 2562

2. The body text ends when you encounter the first 0 decimal value byte.

3. So simply read in the data between those two points.

I tried this on about six files. It worked for them. I can't guarantee
that it will work for all since I couldn't decipher in the Word file
documentation, for which someone posted the link, exactly where the text
began and its length. I simply looked at a few files.
 
I just gave this a try. It appears not difficult for a doc file--much
different for a docx file (which I didn't attempt).

From looking at the byte data of several files, I observed that

1. The body text starts at byte number 2562

2. The body text ends when you encounter the first 0 decimal value byte.

3. So simply read in the data between those two points.

It's somewhat more involved than that, but not too
bad. See here for a VBScript version:

http://www.jsware.net/jsware/scripts.php5#desk

You can pretty much see the text if you just open
a Word .doc in Notepad, but it needs to be cleaned up.
 
Are you guys just trying to show me how to read the doc as text? If that's
what you're trying to show me, that part's easy:

Dim oWord As Word.Application
Dim oDoc As Word.Document
oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Open("c:\SomeWordFile.doc", , True)
oWord.Selection.WholeStory()
Dim wholeText As String = oWord.Selection.Text

I was going to do that and use RegEx to find everything I need but I got
answers to how to read the file as a word doc (not as just text) in the
word.vba.general newsgroup. Reading this as text and using RegEx is a
problem due to the fact that I can't use RegEx to find everything. I need to
find specific line #'s as well. I need all the info on line 4 and the info
on theat line will vary to the point that RegEx would be impractical. Greg
Maxey in the other newsgroup gave me some sample code. I put it into .net
and it got me going in the right direction.

Thanks.
 
In my three steps, I omitted a crucial 4th step. So my method for doc files
should read

1. The body text starts at byte number 2562

2. The body text ends when you encounter the first 0 decimal value byte.

3. So simply read in the data between those two points.

4. In that data only retain those bytes that are less than 123 and greater
than 31 along with line feeds and carriage returns.

That will give you the text and show where the line breaks are. No RegEX
needed as far as I can see to identify the lines.

In your alternate VBA approach, you are using late binding. You might want
to modify this to use early binding as below, where you have set a reference
in your project to the .net Microsoft.Office.Interop.Word, ver. 12. Using
that, you can also read docx files.

The code below displays any word file in a rich text box.

Me.OpenFileDialog1.Title = "Select Word Document"
Me.OpenFileDialog1.FileName = ""
Me.OpenFileDialog1.Filter = "Word Doc (*.doc)|*.doc|Word docx
(*.docx)|*.docx"
If Me.OpenFileDialog1.ShowDialog = Windows.Forms.DialogResult.OK
Then
Path = Me.OpenFileDialog1.FileName
End If

Dim oWord As New Microsoft.Office.Interop.Word.Application
Dim oDoc As New Microsoft.Office.Interop.Word.Document
oDoc = oWord.Documents.Open(Path)
oWord.Selection.WholeStory()
Me.rtbText.Text = oWord.Selection.Text
 
Back
Top