For your original example, 50% of the words match (3 of 6). It doesn't
catch "play" and "character" because they both appear once with a
question mark attached and once without. So I need to improve how it
handles punctuations.
Anyway, 50 would be the value (50% of the words match), but since the
strings are different lengths, it discounts the score via the line:
intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str1Array) - UBound(str2Array)))
Although I made a mistake in this line, which is why you get 58 instead
of 42. It should read:
intRating = (intMatch / (UBound(str1Array) + 1)) * (100 - 8 *
(UBound(str2Array) - UBound(str1Array)))
Thanks for pointing that out.
To give you some background, I have several tables of data I'm trying
to merge, but the labels are sometimes different even though it's
really the same line. The data itself is similar: close, but not
always the same. Since there are thousands of tables, I need a way for
the computer to decide whether the lines in two tables really refers to
the same thing. So my approach has been to rate how close the line in
one table is to the line in another (both label text and table data),
so that they can be merged without having to do it manually.
OK, well here is a routine that tells you how many of the words in the shorter
of two strings will match a word in the longer of the two strings.
It ignores punctuation and extra spaces; and it is also written to be
case-insensitive (although that is easily changed, if you wish).
It uses "regular expressions" to do the parsing and counting and comparing.
Because I am more familiar with it, I have used the regular expression
implementation available in Longre's free morefunc.xll add-in, available at
http://xcell05.free.fr/.
You could set a reference to Microsoft VBScript Regular Expressions 5.5 in your
VBE, but I'm not as familiar with its usage.
I output RateLabel as a decimal so I have set its type to Double.
In my original set of strings, where, ignoring case and punctuation, there were
5/6 matches in the second string, the output is 0.833333333
I leave it to you to make the adjustments you wish for text strings of
different lengths, if you choose to use this method.
======================================
Option Explicit
Function RateLabel(str1 As String, str2 As String) As Double
'fraction of words in shorter string that are also found in longer string
Dim l1 As Long, l2 As Long
Dim t As Long, v As Long
Dim a() As Long
Dim x, y
'count the number of words in each string
l1 = Run([regex.count], str1, "\w+")
l2 = Run([regex.count], str2, "\w+")
ReDim a(1 To IIf(l1 < l2, l1, l2))
For t = 1 To UBound(a)
a(t) = t
Next t
'comparison done with words (no punctuation) and case-insensitive
If l2 < l1 Then
x = Run([regex.mid], str2, "\w+", a)
y = Run([regex.find], str1, x, , False)
Else
x = Run([regex.mid], str1, "\w+", a)
y = Run([regex.find], str2, x, , False)
End If
If IsArray(y) Then
For t = 1 To UBound(y)
If y(t) > 0 Then v = v + 1
Next t
RateLabel = v / UBound(y)
Else
'if shorter string only a single word, then no array is formed
RateLabel = IIf(y > 0, 1, 0)
End If
MsgBox "str1 = " & str1 & ", str2 = " & str2 & ", rating = " & RateLabel
End Function
===============================
--ron