text compare, percentage of similar words

B

bodhi2.71828

I'm trying to compare text, but the text is not always exactly the
same. Sometimes words are rearranged, changed, missing. So I'd like
to create a function that returns the percentage of words that are
common between the two strings. Any suggestions? Also, is there a
better way to compare strings that are not quite equal? Thanks.
 
R

Ron Rosenfeld

I'm trying to compare text, but the text is not always exactly the
same. Sometimes words are rearranged, changed, missing. So I'd like
to create a function that returns the percentage of words that are
common between the two strings. Any suggestions? Also, is there a
better way to compare strings that are not quite equal? Thanks.

Given these two sentences:

Who is the best character in the play?
Who is the worst play character?

what would you expect for a result?

Note that "the" appears twice in the first line, but only once in the second.
But the situation could be reversed for other words.

So there are six words in the first string that can be found in the second
string.

And there are 5 words in the second string that can be found in the first.
--ron
 
B

bodhi2.71828

Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so
far. I added a penalty for strings that have many of the same words
but are significantly different in length.

Private Function RateLabel(str1 As String, str2 As String) As Integer
Dim i As Integer, j As Integer, intMatch As Integer, intRating As
Integer, str1Array, str2Array, strTemp
If str1 <> "" And str2 <> "" Then
If StrComp(str1, str2, vbTextCompare) = 0 Then
RateLabel = 100
Else
str1 = Application.WorksheetFunction.Trim(str1)
str2 = Application.WorksheetFunction.Trim(str2)
str1Array = Split(str1, " ")
str2Array = Split(str2, " ")
'make sure str1Array is always the smaller of the two
If UBound(str1Array) > UBound(str2Array) Then
strTemp = str1Array
str1Array = str2Array
str2Array = strTemp
End If
'count words and determine % that match
For i = 0 To UBound(str1Array)
For j = 0 To UBound(str2Array)
If StrComp(str1Array(i), str2Array(j),
vbTextCompare) = 0 Then
intMatch = intMatch + 1
Exit For
End If
Next j
Next i
intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8
* (UBound(str1Array) - UBound(str2Array)))
If intRating > 0 Then RateLabel = intRating Else RateLabel
= 0
End If
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " &
RateLabel
End Function
 
H

Helmut Weber

Hi,

google for Levenshtein distance.

There are other algorithms,
which may return a correlation coefficient
that matches better the intuition of an ordinary language user,
one of these algorithms by me,
but far too complicated and far too slow. :-(

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 
R

Ron Rosenfeld

Good question. I probably should buy a book or take a class on natural
language processing. But until I can do that, here is what I have so
far. I added a penalty for strings that have many of the same words
but are significantly different in length.

Well, I'm still not sure what you are getting at.

Using your formula with my test strings, I get a result of "58". I don't
understand the logic of that.

Replacing my 2nd string with a word that does not appear in the first, I get a
result of '0'. That makes sense.

But if I replace the second string with a single word that appears in the
first, I get a rating of 156. Again, I don't understand that.

--ron
 
B

bodhi2.71828

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.
 
B

bodhi2.71828

Thanks Helmut. That is a great start. I need to find something a
little more advanced than Levenshtein, so that "What are you?" and "You
are what?" get a higher score. But a great start, thanks. If you have
any more recommendations on search/language processing, I would love to
look into them.
 
R

Ron Rosenfeld

For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.

Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))

Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:

intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))

Thanks for pointing that out.

To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.

OK, well here is a routine that tells you how many of the words in the shorter
of two strings will match a word in the longer of the two strings.

It ignores punctuation and extra spaces; and it is also written to be
case-insensitive (although that is easily changed, if you wish).

It uses "regular expressions" to do the parsing and counting and comparing.

Because I am more familiar with it, I have used the regular expression
implementation available in Longre's free morefunc.xll add-in, available at
http://xcell05.free.fr/.

You could set a reference to Microsoft VBScript Regular Expressions 5.5 in your
VBE, but I'm not as familiar with its usage.

I output RateLabel as a decimal so I have set its type to Double.

In my original set of strings, where, ignoring case and punctuation, there were
5/6 matches in the second string, the output is 0.833333333

I leave it to you to make the adjustments you wish for text strings of
different lengths, if you choose to use this method.

======================================
Option Explicit

Function RateLabel(str1 As String, str2 As String) As Double
'fraction of words in shorter string that are also found in longer string

Dim l1 As Long, l2 As Long
Dim t As Long, v As Long
Dim a() As Long
Dim x, y

'count the number of words in each string
l1 = Run([regex.count], str1, "\w+")
l2 = Run([regex.count], str2, "\w+")


ReDim a(1 To IIf(l1 < l2, l1, l2))

For t = 1 To UBound(a)
a(t) = t
Next t

'comparison done with words (no punctuation) and case-insensitive

If l2 < l1 Then
x = Run([regex.mid], str2, "\w+", a)
y = Run([regex.find], str1, x, , False)
Else
x = Run([regex.mid], str1, "\w+", a)
y = Run([regex.find], str2, x, , False)
End If

If IsArray(y) Then
For t = 1 To UBound(y)
If y(t) > 0 Then v = v + 1
Next t
RateLabel = v / UBound(y)
Else
'if shorter string only a single word, then no array is formed
RateLabel = IIf(y > 0, 1, 0)
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " & RateLabel
End Function
===============================


--ron
 
H

Helmut Weber

Googling for

string similarity

returns thousands of hits.

--
Greetings from Bavaria, Germany

Helmut Weber, MVP WordVBA

Win XP, Office 2003
"red.sys" & Chr$(64) & "t-online.de"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top