Fuzzy string comparison / detecting "similar" strings

xirx · Feb 15, 2005

When dealing with real live data, you often have some
variation of minor errors in your data. E.g. I have
two lists (databases) in which Names sligthly differ.

Examples:

"Clark Kent" vs "Clark Kent"
"John P. Smith" vs "John Paul Smith"
"Miller Limited" vs "Miller Ltd."
"Peter Hammer" vs "Petre Hammer"

I am looking for a way to handle this (semi-) automatic.
My idea is to have a function f, that takes two strings
and delivers a measure on how much the are alike. E.g.
f should be 1, if both arguments are identical and it
should be 0 if they are "completely" different.

I am pretty sure that a lot of ppl have been thinking
abouut such a thing already and there should be more
than one solution for this.

Any pointers?

Harlan Grove · Feb 15, 2005

xirx wrote...

When dealing with real live data, you often have some
variation of minor errors in your data. E.g. I have
two lists (databases) in which Names sligthly differ.

....

Read the two linked threads in

http://groups-beta.google.com/group/microsoft.public.excel.misc/msg/ce86dfc0974048ac

(or http://makeashorterlink.com/?G12C11C7A ). You're correct that other
people have discussed this before, so you should search the newsgroup
archives before posting questions.

Fuzzy string comparison / detecting "similar" strings

xirx

Harlan Grove