text compare, percentage of similar words

bodhi2.71828 · Jan 28, 2006

I'm trying to compare text, but the text is not always exactly the
same. Sometimes words are rearranged, changed, missing. So I'd like
to create a function that returns the percentage of words that are
common between the two strings. Any suggestions? Also, is there a
better way to compare strings that are not quite equal? Thanks.

Ron Rosenfeld · Jan 28, 2006

I'm trying to compare text, but the text is not always exactly the
same. Sometimes words are rearranged, changed, missing. So I'd like
to create a function that returns the percentage of words that are
common between the two strings. Any suggestions? Also, is there a
better way to compare strings that are not quite equal? Thanks.

Given these two sentences:

Who is the best character in the play?
Who is the worst play character?

what would you expect for a result?

Note that "the" appears twice in the first line, but only once in the second.
But the situation could be reversed for other words.

So there are six words in the first string that can be found in the second
string.

And there are 5 words in the second string that can be found in the first.
--ron

bodhi2.71828 · Jan 28, 2006

Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so
far. I added a penalty for strings that have many of the same words
but are significantly different in length.

Private Function RateLabel(str1 As String, str2 As String) As Integer
Dim i As Integer, j As Integer, intMatch As Integer, intRating As
Integer, str1Array, str2Array, strTemp
If str1 <> "" And str2 <> "" Then
If StrComp(str1, str2, vbTextCompare) = 0 Then
RateLabel = 100
Else
str1 = Application.WorksheetFunction.Trim(str1)
str2 = Application.WorksheetFunction.Trim(str2)
str1Array = Split(str1, " ")
str2Array = Split(str2, " ")
'make sure str1Array is always the smaller of the two
If UBound(str1Array) > UBound(str2Array) Then
strTemp = str1Array
str1Array = str2Array
str2Array = strTemp
End If
'count words and determine % that match
For i = 0 To UBound(str1Array)
For j = 0 To UBound(str2Array)
If StrComp(str1Array(i), str2Array(j),
vbTextCompare) = 0 Then
intMatch = intMatch + 1
Exit For
End If
Next j
Next i
intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8
* (UBound(str1Array) - UBound(str2Array)))
If intRating > 0 Then RateLabel = intRating Else RateLabel
= 0
End If
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " &
RateLabel
End Function

Helmut Weber · Jan 28, 2006

Hi,

google for Levenshtein distance.

There are other algorithms,
which may return a correlation coefficient
that matches better the intuition of an ordinary language user,
one of these algorithms by me,
but far too complicated and far too slow. :-(

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

Ron Rosenfeld · Jan 28, 2006

Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so
far. I added a penalty for strings that have many of the same words
but are significantly different in length.

Well, I'm still not sure what you are getting at.

Using your formula with my test strings, I get a result of "58". I don't
understand the logic of that.

Replacing my 2nd string with a word that does not appear in the first, I get a
result of '0'. That makes sense.

But if I replace the second string with a single word that appears in the
first, I get a rating of 156. Again, I don't understand that.

--ron

bodhi2.71828 · Jan 28, 2006

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.

bodhi2.71828 · Jan 28, 2006

Thanks Helmut. That is a great start. I need to find something a
little more advanced than Levenshtein, so that "What are you?" and "You
are what?" get a higher score. But a great start, thanks. If you have
any more recommendations on search/language processing, I would love to
look into them.

Ron Rosenfeld · Jan 29, 2006

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.

OK, well here is a routine that tells you how many of the words in the shorter
of two strings will match a word in the longer of the two strings.

It ignores punctuation and extra spaces; and it is also written to be
case-insensitive (although that is easily changed, if you wish).

It uses "regular expressions" to do the parsing and counting and comparing.

Because I am more familiar with it, I have used the regular expression
implementation available in Longre's free morefunc.xll add-in, available at
http://xcell05.free.fr/.

You could set a reference to Microsoft VBScript Regular Expressions 5.5 in your
VBE, but I'm not as familiar with its usage.

I output RateLabel as a decimal so I have set its type to Double.

In my original set of strings, where, ignoring case and punctuation, there were
5/6 matches in the second string, the output is 0.833333333

I leave it to you to make the adjustments you wish for text strings of
different lengths, if you choose to use this method.

======================================
Option Explicit

Function RateLabel(str1 As String, str2 As String) As Double
'fraction of words in shorter string that are also found in longer string

Dim l1 As Long, l2 As Long
Dim t As Long, v As Long
Dim a() As Long
Dim x, y

'count the number of words in each string
l1 = Run([regex.count], str1, "\w+")
l2 = Run([regex.count], str2, "\w+")

ReDim a(1 To IIf(l1 < l2, l1, l2))

For t = 1 To UBound(a)
a(t) = t
Next t

'comparison done with words (no punctuation) and case-insensitive

If l2 < l1 Then
x = Run([regex.mid], str2, "\w+", a)
y = Run([regex.find], str1, x, , False)
Else
x = Run([regex.mid], str1, "\w+", a)
y = Run([regex.find], str2, x, , False)
End If

If IsArray(y) Then
For t = 1 To UBound(y)
If y(t) > 0 Then v = v + 1
Next t
RateLabel = v / UBound(y)
Else
'if shorter string only a single word, then no array is formed
RateLabel = IIf(y > 0, 1, 0)
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " & RateLabel
End Function
===============================

--ron

Helmut Weber · Jan 29, 2006

Googling for

string similarity

returns thousands of hits.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"

compare columns	2	Jul 26, 2006
Comparing Text Contents of Cells	4	Dec 16, 2008
Similar Function in VBA - does it exist ?	6	Jul 22, 2009
String Compare... Tough one?	7	Nov 15, 2007
Compare three strings?	6	Feb 5, 2007
How to program this?	4	Jun 25, 2009
Compare Two Strings	3	Aug 12, 2006
compare contents of two folders	8	Apr 28, 2009

text compare, percentage of similar words

bodhi2.71828

Ron Rosenfeld

bodhi2.71828

Helmut Weber

Ron Rosenfeld

bodhi2.71828

bodhi2.71828

Ron Rosenfeld

Helmut Weber

Ask a Question

Similar Threads