Finding unique words

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi all,
Does anyone know if it's possible to make a list of all the unique words in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all the
words first for that. I'm an amateur Java programmer, so if you know Java,
you know that we can use StringTokenizers and HashSets to do this for small
strings, but is there a way to do that on a larger scale for a Word file (I
know it's a different programming language too, the Java was just an example)
that's a few hundred pages long?
Thanks!
 
What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and
convert all white space to paragraph marks; then copy to Excel and do a
unique filter.

If you want to do it with VBA, iterate the words collection, check whether
the 'word' is text, and if so, add it to a collection using it as both the
key and the item. Since keys must be unique, you end up with a unique list.
 
Hi Jezebel,
There isn't anything wrong with making a copy and destroying all the
punctuation, I've done that before and it works well. I was just hoping
there was a faster way because it takes quite a while to go through all the
possible punctuation marks and stuff. Is there a way to quickly replace
anything nontext with a paragraph break?
 
Use Edit | Replace and replace full stop with paragraph break. Same again for
comma and any other punctuation mark in the document.
 
Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:
With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




jezzica85 said:
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!
 
The find text is a regular expression. The exclamation mark means 'not' --
so the expression means 'match any character other than a-z, upper or lower
case. You can add any other characters you also want to exclude, eg
[!A-Za-z,-] You have to put the hyphen last, otherwise it's interpreted as
a range indicator.

The caron means that the following digits are a decimal character number.
013 = paragraph mark.




jezzica85 said:
Thanks Jezebel, that works really well, but I notice it destroys hyphens
and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013
mean?
Thanks!

Jezebel said:
With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




jezzica85 said:
Hi all,
Does anyone know if it's possible to make a list of all the unique
words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know
all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word
file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!
 
A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table >
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:
With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




jezzica85 said:
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!
 
Thank you Jezebel and Jay, this was really helpful.
jezzica85

Jay Freedman said:
A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table >
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:
With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top