Comparing strings (with typos)

R

Ronald

Hi.

I'm very much in need for a routine that compares 2 strings (addresses).
1 string has the correct address, the other one has the same address but can
contain typos.
You can't just use Compare.
How can you get a probability (percentage) that the second string equals the
first string?

Thanks in advance,

Ronald.
 
T

Tom van Stiphout

You need a fuzzy-string comparison routine. One of my favorite topics.
One option is to use the Ratcliff/Obershelp algorithm which is in the
public domain. I'm writing an article about this but it's not yet
ready for publication so you'll have to do some work yourself.

-Tom.
Microsoft Access MVP
 
T

Tom van Stiphout

You need a fuzzy-string comparison routine. One of my favorite topics.
One option is to use the Ratcliff/Obershelp algorithm which is in the
public domain. I'm writing an article about this but it's not yet
ready for publication so you'll have to do some work yourself.

-Tom.
Microsoft Access MVP
 
K

kc-mass

Why not simply:

1. Replace all string2 with string1 and then they are both equal and correct

or

2. If you just want to know which do not match
tfResult = (string2 = string1)

I don't think probability and equality mesh well.
 
K

kc-mass

Why not simply:

1. Replace all string2 with string1 and then they are both equal and correct

or

2. If you just want to know which do not match
tfResult = (string2 = string1)

I don't think probability and equality mesh well.
 
K

kc-mass

Hi John

I am stumped here. If I compare two fixed strings can I say that there is
an 80% or 50% chance that they are equal or can I only say that they are or
are not equal?

I am probably showing my ignorance here but even that is worth knowing.

Thx

Kevin
 
K

kc-mass

Hi John

I am stumped here. If I compare two fixed strings can I say that there is
an 80% or 50% chance that they are equal or can I only say that they are or
are not equal?

I am probably showing my ignorance here but even that is worth knowing.

Thx

Kevin
 
J

John W. Vinson

Hi John

I am stumped here. If I compare two fixed strings can I say that there is
an 80% or 50% chance that they are equal or can I only say that they are or
are not equal?

Well, two strings are either identical or they're not: "3115 Main St." and
"3115 Main Street" are clearly, unambiguously, absolutely not identical. Ask
any computer <g>...

In practice you do need "fuzzy" matches. Is "Joe Smith" the same person as
"Joseph Smith"? Maybe. Are "315 W. Main St." and "315 West Main St." the same
address? Probably.

Unfortunately no automated solution is going to be as good as a USB interface
(Using Someone's Brain) but John's (and other) suggested software can make the
chore a bit easier by doing some prefiltering.
 
J

John W. Vinson

Hi John

I am stumped here. If I compare two fixed strings can I say that there is
an 80% or 50% chance that they are equal or can I only say that they are or
are not equal?

Well, two strings are either identical or they're not: "3115 Main St." and
"3115 Main Street" are clearly, unambiguously, absolutely not identical. Ask
any computer <g>...

In practice you do need "fuzzy" matches. Is "Joe Smith" the same person as
"Joseph Smith"? Maybe. Are "315 W. Main St." and "315 West Main St." the same
address? Probably.

Unfortunately no automated solution is going to be as good as a USB interface
(Using Someone's Brain) but John's (and other) suggested software can make the
chore a bit easier by doing some prefiltering.
 
J

John Spencer MVP

The code that i referred you to will return a number representing the degree
that the two strings are alike. For instance,

Simil("John Spencer","Jon Spenser") reurns 0.869565217391304
Simil("John Spencer","Joxn Spenser") returns 0.833333333333333
Simil("John Spencer","John Spencer") returns 1
Simil("John Spencer","John Spence") 0.956521739130435
Simil("John Spencer","John Spencre") 0.916666666666667
Simil("John smith","John Spencee") 0.545454545454545
Simil("cc smith","John Spencee") 0.2
Simil("BarbaraAnne","xxxxxxxxx") 0

I haven't looked at the code enough to say how accurate its ratings are and
exactly how the algorithm works. Google Ratcliff Obershelp and read up on
what this is supposed to do.

John Spencer
Access MVP 2002-2005, 2007-2009
The Hilltop Institute
University of Maryland Baltimore County
 
J

John Spencer MVP

The code that i referred you to will return a number representing the degree
that the two strings are alike. For instance,

Simil("John Spencer","Jon Spenser") reurns 0.869565217391304
Simil("John Spencer","Joxn Spenser") returns 0.833333333333333
Simil("John Spencer","John Spencer") returns 1
Simil("John Spencer","John Spence") 0.956521739130435
Simil("John Spencer","John Spencre") 0.916666666666667
Simil("John smith","John Spencee") 0.545454545454545
Simil("cc smith","John Spencee") 0.2
Simil("BarbaraAnne","xxxxxxxxx") 0

I haven't looked at the code enough to say how accurate its ratings are and
exactly how the algorithm works. Google Ratcliff Obershelp and read up on
what this is supposed to do.

John Spencer
Access MVP 2002-2005, 2007-2009
The Hilltop Institute
University of Maryland Baltimore County
 
K

kc-mass

I have my mouth open so I might as well place the foot there -

When the algorithm returns a percentage, other than 100% expressed as "1"
isn't it just showing the degree of match or non erroroneos characters in
the pattern and not the "probability" of "equality"? Isn't the probability
of equality in any of these instances (save the 1) 0%?

Thx

Kevin
 
K

kc-mass

I have my mouth open so I might as well place the foot there -

When the algorithm returns a percentage, other than 100% expressed as "1"
isn't it just showing the degree of match or non erroroneos characters in
the pattern and not the "probability" of "equality"? Isn't the probability
of equality in any of these instances (save the 1) 0%?

Thx

Kevin
 
J

John W. Vinson

I have my mouth open so I might as well place the foot there -

When the algorithm returns a percentage, other than 100% expressed as "1"
isn't it just showing the degree of match or non erroroneos characters in
the pattern and not the "probability" of "equality"? Isn't the probability
of equality in any of these instances (save the 1) 0%?

I'd describe it as "the level of similarity" rather than "the probability of
equality".
 
J

John W. Vinson

I have my mouth open so I might as well place the foot there -

When the algorithm returns a percentage, other than 100% expressed as "1"
isn't it just showing the degree of match or non erroroneos characters in
the pattern and not the "probability" of "equality"? Isn't the probability
of equality in any of these instances (save the 1) 0%?

I'd describe it as "the level of similarity" rather than "the probability of
equality".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top