counting unique words in a document

O

O.Berkeley

I know that Word 2007 can count the number of words in a document, but is
there a way of adapting that ability to come up with a list of unique words
found in a file with the number of times each word appears?

Many thanks.
 
D

Doug Robbins - Word MVP

The following macro will do what you want:

Sub WordFrequency()

Dim SingleWord As String 'Raw word pulled from doc
Const maxwords = 9000 'Maximum unique words allowed
Dim Words(maxwords) As String 'Array to hold unique words
Dim Freq(maxwords) As Integer 'Frequency counter for Unique
Words
Dim WordNum As Integer 'Number of unique words
Dim ByFreq As Boolean 'Flag for sorting order
Dim ttlwds As Long 'Total words in the document
Dim Excludes As String 'Words to be excluded
Dim Found As Boolean 'Temporary flag
Dim j, k, l, Temp As Integer 'Temporary variables
Dim tword As String '

' Set up excluded words
' Excludes =
"[the][a][of][is][to][for][this][that][by][be][and][are]"
Excludes = ""
Excludes = InputBox$("Enter words that you wish to exclude,
surrounding each word with [ ].", "Excluded Words", "")
' Excludes = Excludes & InputBox$("The following words are excluded:
" & Excludes & ". Enter words that you wish to exclude, surrounding each
'word with [ ].", "Excluded Words", "")
' Find out how to sort
ByFreq = True
Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order", "FREQ")
If Ans = "" Then End
If UCase(Ans) = "WORD" Then
ByFreq = False
End If

Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
WordNum = 0
ttlwds = ActiveDocument.Words.Count
Totalwords = ActiveDocument.Words.Count

' Control the repeat
For Each aword In ActiveDocument.Words
SingleWord = Trim(aword)
If SingleWord < "A" Or SingleWord > "z" Then SingleWord = ""
'Out of range?
If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = ""
'On exclude list?
If Len(SingleWord) > 0 Then
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords - 1 Then
j = MsgBox("The maximum array size has been exceeded.
Increase maxwords.", vbOKOnly)
Exit For
End If
End If
ttlwds = ttlwds - 1
StatusBar = "Remaining: " & ttlwds & " Unique: " & WordNum
Next aword

' Now sort it into word order
For j = 1 To WordNum - 1
k = j
For l = j + 1 To WordNum
If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And
Freq(l) > Freq(k)) Then k = l
Next l
If k <> j Then
tword = Words(j)
Words(j) = Words(k)
Words(k) = tword
Temp = Freq(j)
Freq(j) = Freq(k)
Freq(k) = Temp
End If
StatusBar = "Sorting: " & WordNum - j
Next j

' Now write out the results
tmpName = ActiveDocument.AttachedTemplate.FullName
Documents.Add Template:=tmpName, NewTemplate:=False
Selection.ParagraphFormat.TabStops.ClearAll
With Selection
For j = 1 To WordNum
.TypeText Text:=Words(j) & vbTab & Trim(Str(Freq(j))) &
vbCrLf
Next j
End With
ActiveDocument.Range.Select
Selection.ConvertToTable
Selection.Collapse wdCollapseStart
ActiveDocument.Tables(1).Rows.Add BeforeRow:=Selection.Rows(1)
ActiveDocument.Tables(1).Cell(1, 1).Range.InsertBefore "Word"
ActiveDocument.Tables(1).Cell(1, 2).Range.InsertBefore
"Occurrences"
ActiveDocument.Tables(1).Range.ParagraphFormat.Alignment =
wdAlignParagraphCenter
ActiveDocument.Tables(1).Rows.Add
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
1).Range.InsertBefore "Total words in Document"
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
2).Range.InsertBefore Totalwords
ActiveDocument.Tables(1).Rows.Add
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
1).Range.InsertBefore "Number of different words in Document"
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
2).Range.InsertBefore Trim(Str(WordNum))
System.Cursor = wdCursorNormal
' j = MsgBox("There were " & Trim(Str(WordNum)) & " different words
", vbOKOnly, "Finished")
Selection.HomeKey wdStory

End Sub



--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP, originally posted via msnews.microsoft.com
 
G

Graham Mayor

Nice one - but like all macros of this type it will inevitably take a while
to run with any document more than a couple of pages long

I think I would also be inclined to change the three lines associated with
' Control the repeat

From

For Each aword In ActiveDocument.Words
SingleWord = Trim(aword)
If SingleWord < "A" Or SingleWord > "z" Then SingleWord = "" 'Out of
range?

To

For Each aword In ActiveDocument.Words
SingleWord = LCase(Trim(aword))
If SingleWord < "a" Or SingleWord > "z" Then SingleWord = "" 'Out of
range?

to avoid duplication where words or parts of words are capitalized.

--
<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
Graham Mayor - Word MVP

My web site www.gmayor.com

<>>< ><<> ><<> <>>< ><<> <>>< <>><<>

The following macro will do what you want:

Sub WordFrequency()

Dim SingleWord As String 'Raw word pulled from doc
Const maxwords = 9000 'Maximum unique words
allowed Dim Words(maxwords) As String 'Array to hold
unique words Dim Freq(maxwords) As Integer 'Frequency
counter for Unique Words
Dim WordNum As Integer 'Number of unique words
Dim ByFreq As Boolean 'Flag for sorting order
Dim ttlwds As Long 'Total words in the
document Dim Excludes As String 'Words to be
excluded Dim Found As Boolean 'Temporary flag
Dim j, k, l, Temp As Integer 'Temporary variables
Dim tword As String '

' Set up excluded words
' Excludes =
"[the][a][of][is][to][for][this][that][by][be][and][are]"
Excludes = ""
Excludes = InputBox$("Enter words that you wish to exclude,
surrounding each word with [ ].", "Excluded Words", "")
' Excludes = Excludes & InputBox$("The following words are
excluded: " & Excludes & ". Enter words that you wish to exclude,
surrounding each 'word with [ ].", "Excluded Words", "")
' Find out how to sort
ByFreq = True
Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order",
"FREQ") If Ans = "" Then End
If UCase(Ans) = "WORD" Then
ByFreq = False
End If

Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
WordNum = 0
ttlwds = ActiveDocument.Words.Count
Totalwords = ActiveDocument.Words.Count

' Control the repeat
For Each aword In ActiveDocument.Words
SingleWord = Trim(aword)
If SingleWord < "A" Or SingleWord > "z" Then SingleWord =
"" 'Out of range?
If InStr(Excludes, "[" & SingleWord & "]") Then
SingleWord = "" 'On exclude list?
If Len(SingleWord) > 0 Then
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords - 1 Then
j = MsgBox("The maximum array size has been
exceeded. Increase maxwords.", vbOKOnly)
Exit For
End If
End If
ttlwds = ttlwds - 1
StatusBar = "Remaining: " & ttlwds & " Unique: " &
WordNum Next aword

' Now sort it into word order
For j = 1 To WordNum - 1
k = j
For l = j + 1 To WordNum
If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And
Freq(l) > Freq(k)) Then k = l
Next l
If k <> j Then
tword = Words(j)
Words(j) = Words(k)
Words(k) = tword
Temp = Freq(j)
Freq(j) = Freq(k)
Freq(k) = Temp
End If
StatusBar = "Sorting: " & WordNum - j
Next j

' Now write out the results
tmpName = ActiveDocument.AttachedTemplate.FullName
Documents.Add Template:=tmpName, NewTemplate:=False
Selection.ParagraphFormat.TabStops.ClearAll
With Selection
For j = 1 To WordNum
.TypeText Text:=Words(j) & vbTab & Trim(Str(Freq(j)))
& vbCrLf
Next j
End With
ActiveDocument.Range.Select
Selection.ConvertToTable
Selection.Collapse wdCollapseStart
ActiveDocument.Tables(1).Rows.Add BeforeRow:=Selection.Rows(1)
ActiveDocument.Tables(1).Cell(1, 1).Range.InsertBefore "Word"
ActiveDocument.Tables(1).Cell(1, 2).Range.InsertBefore
"Occurrences"
ActiveDocument.Tables(1).Range.ParagraphFormat.Alignment =
wdAlignParagraphCenter
ActiveDocument.Tables(1).Rows.Add
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
1).Range.InsertBefore "Total words in Document"
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
2).Range.InsertBefore Totalwords
ActiveDocument.Tables(1).Rows.Add
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
1).Range.InsertBefore "Number of different words in Document"
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
2).Range.InsertBefore Trim(Str(WordNum)) System.Cursor =
wdCursorNormal ' j = MsgBox("There were " & Trim(Str(WordNum))
& " different words ", vbOKOnly, "Finished")
Selection.HomeKey wdStory

End Sub




Doug Robbins - Word MVP, originally posted via msnews.microsoft.com
O.Berkeley said:
I know that Word 2007 can count the number of words in a document,
but is there a way of adapting that ability to come up with a list
of unique words
found in a file with the number of times each word appears?

Many thanks.
 
O

O.Berkeley

Wow! Thank you so much for both of your suggestions! This is terrific!

I will try this out once I figure out how to run macros. I assume that I use
the Developers Ribbon to do this.

I am very grateful for your help. It is very kind of you.

All the best
 
D

Doug Robbins - Word MVP

See the article "What do I do with macros sent to me by other newsgroup
readers to help me out?†at:
http://www.word.mvps.org/FAQs/MacrosVBA/CreateAMacro.htm


--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP, originally posted via msnews.microsoft.com
 
O

O.Berkeley

Hello!

The macro works really well once I figured out where to cut and paste the
lines of the macro. I'm so impressed with how fast it is.

However, I have a really dumb question for you.

I noticed that the total number of words in the document comes out too high
when compared with the word count built into Word 2007.

As far as I can see the total number of words in the document is defined as:


Totalwords = ActiveDocument.Words.Count


I assume that this will be the same value that the built in function of Word
gives and the only other place this value is used is at the end where I
assume this value is printed out:


ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
1).Range.InsertBefore "Total words in Document"
ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count,
2).Range.InsertBefore Totalwords


Shouldn't this be the same value as the word processor's count? Why is it so
much higher? Obviously, I'm missing something here.

Many thanks. Sorry for the trouble.
 
O

O.Berkeley

Thank you.

Sorry, but I don't think that I made myself clear. I did manage to figure
how to run the macro. It works very well, only the total number of words in
the document (from Word 2007's built in function under the Review ribbon)
does not agree with the macro's printout of the total number of words.

Many thanks.
 
O

O.Berkeley

I think that i may have found my answer about the disparity between the real
word count and ActiveDocument.Words.Count

Apparently the ActiveDocument.Words.Count term includes both the total
number of words and things like the paragraph character.
 
D

Doug Robbins - Word MVP

Right. If you replace the line of code

Totalwords = ActiveDocument.Words.Count

with

Totalwords =
ActiveDocument.BuiltInDocumentProperties(wdPropertyWords)

you will get the same number of words.
--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP, originally posted via msnews.microsoft.com
 
G

Gus

O.Berkeley said:
I know that Word 2007 can count the number of words in a document, but is
there a way of adapting that ability to come up with a list of unique
words
found in a file with the number of times each word appears?

Many thanks.

Put the word in between spaces in Replace and in Replace with use say xxx,
also between spaces, then click Replace All. It will show the number of
replacements.
Then close the Find and Replace window and click the undo button.
Gus
 
G

Graham Mayor

Gus said:
Put the word in between spaces in Replace and in Replace with use say
xxx, also between spaces, then click Replace All. It will show the
number of replacements.
Then close the Find and Replace window and click the undo button.
Gus

Put ^& in the replace with box and you won't need to undo anything.
However it does not address the original question which was to produce a
*list* of unique words with their counts.
--
<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
Graham Mayor - Word MVP

My web site www.gmayor.com

<>>< ><<> ><<> <>>< ><<> <>>< <>><<>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top