Finding unique words

Guest · Apr 22, 2006

Hi all,
Does anyone know if it's possible to make a list of all the unique words in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all the
words first for that. I'm an amateur Java programmer, so if you know Java,
you know that we can use StringTokenizers and HashSets to do this for small
strings, but is there a way to do that on a larger scale for a Word file (I
know it's a different programming language too, the Java was just an example)
that's a few hundred pages long?
Thanks!

Jezebel · Apr 22, 2006

What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and
convert all white space to paragraph marks; then copy to Excel and do a
unique filter.

If you want to do it with VBA, iterate the words collection, check whether
the 'word' is text, and if so, add it to a collection using it as both the
key and the item. Since keys must be unique, you end up with a unique list.

Guest · Apr 22, 2006

Hi Jezebel,
There isn't anything wrong with making a copy and destroying all the
punctuation, I've done that before and it works well. I was just hoping
there was a faster way because it takes quite a while to go through all the
possible punctuation marks and stuff. Is there a way to quickly replace
anything nontext with a paragraph break?

Guest · Apr 22, 2006

Use Edit | Replace and replace full stop with paragraph break. Same again for
comma and any other punctuation mark in the document.

Jezebel · Apr 22, 2006

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013

Guest · Apr 23, 2006

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:
With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013

jezzica85 said:

Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!

Click to expand...

Jezebel · Apr 23, 2006

The find text is a regular expression. The exclamation mark means 'not' --
so the expression means 'match any character other than a-z, upper or lower
case. You can add any other characters you also want to exclude, eg
[!A-Za-z,-] You have to put the hyphen last, otherwise it's interpreted as
a range indicator.

The caron means that the following digits are a decimal character number.
013 = paragraph mark.

jezzica85 said:
Thanks Jezebel, that works really well, but I notice it destroys hyphens
and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013
mean?
Thanks!

Jezebel said:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013

jezzica85 said:

Hi all,
Does anyone know if it's possible to make a list of all the unique
words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know
all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word
file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!

Click to expand...

Click to expand...

Jay Freedman · Apr 23, 2006

A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table >
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

Jezebel said:
Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013

jezzica85 said:

Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!

Click to expand...

Click to expand...

Guest · Apr 23, 2006

Thank you Jezebel and Jay, this was really helpful.
jezzica85

Jay Freedman said:
A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table >
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

Jezebel said:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013

Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!

Click to expand...

Click to expand...

Building an Index--Use of Concordance File	1	Nov 1, 2007
counting unique words in a document	11	Aug 23, 2009
Automatically mark index entries by using a concordance file	7	Jan 27, 2008
How can I create a frequency wordlist of a text?	2	Jan 14, 2007
Are index enries in Word case sensitive?	1	Nov 21, 2005
program to list each unique word in a document	4	Oct 8, 2003
No index entries were marked	0	Sep 13, 2007
Cannot open file after inserting index	2	Jul 27, 2008

Finding unique words

Guest

Jezebel

Guest

Guest

Jezebel

Guest

Jezebel

Jay Freedman

Guest

Ask a Question

Similar Threads