Fuzzy logic to find duplicate Records

S

sheetal.sash

Hello,

Does anyone know if MS Access 2007 uses fuzzy logic to find duplicate
records? We have this huge table with Staff details, which have tons
of duplicates due to typos in first name, last name, zip code, city
etc. So wanted to delete duplicates usig fuzzy logic, which would
detect that a record with "Jane Doe" and "Jane P. Doe" is the same if
they have the same SSN , DOB, Address , city,zip.

Any ideas?

Thanks!
SS
 
6

'69 Camaro

Hi.
Does anyone know if MS Access 2007 uses fuzzy logic to find duplicate
records?

No version of Access uses fuzzy logic do find duplicate records. Either all
columns in two or more records exactly match, or they don't.
So wanted to delete duplicates usig fuzzy logic, which would
detect that a record with "Jane Doe" and "Jane P. Doe" is the same if
they have the same SSN , DOB, Address , city,zip.

"Jane Doe" and "Jane P. Doe" are not duplicates. However, if they have the
same SSN, DOB, address, city and zip, one could write a query that takes the
distinct rows of just these columns, add the name columns, and then insert
those records into a new table. However, unless you do something fancy,
sometimes the database engine will pick the "Jane Doe" version and sometimes
pick the "Jane P. Doe" version to go with the "distinct" SSN, DOB, address,
city and zip, so if you ran the same query 10 times, you'd get a different
set of records every time.

HTH.
Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blog: http://DataDevilDog.BlogSpot.com
http://www.Access.QBuilt.com/html/expert_contributors2.html for contact
info.
 
J

John W. Vinson

Hello,

Does anyone know if MS Access 2007 uses fuzzy logic to find duplicate
records? We have this huge table with Staff details, which have tons
of duplicates due to typos in first name, last name, zip code, city
etc. So wanted to delete duplicates usig fuzzy logic, which would
detect that a record with "Jane Doe" and "Jane P. Doe" is the same if
they have the same SSN , DOB, Address , city,zip.

As Gunny says... no. You'll need to do it yourself.

And of course Address is problematic as well: the text strings "2124
W. Main St." and "2124 W Main" are NOT duplicates.

There are good reasons why "mailing list cleanup" companies charge a
pretty penny for their services. It's simply a difficult task; it can
be made easier using (many!) queries, but alert human minds must be
involved in the process, lest you miss actual duplicates and/or delete
records which aren't actually duplicates.

John W. Vinson [MVP]
 
P

Paul Shapiro

SQL Server 2005's SSIS component includes fuzzy matching and you could use
it on Access data. I found it good for identifying possible duplicates, but
still a person needs to look at it to be sure. Names and addresses can be
almost the same and still be duplicates or not duplicates. The fuzzy
matching algorithms includes an indicator of how good the match seems to be,
which prioritizes the rows to look at.
Paul Shapiro
 
S

sheetal.sash

SQL Server 2005's SSIS component includes fuzzy matching and you could use
it on Access data. I found it good for identifying possible duplicates, but
still a person needs to look at it to be sure. Names and addresses can be
almost the same and still be duplicates or not duplicates. The fuzzy
matching algorithms includes an indicator of how good the match seems to be,
which prioritizes the rows to look at.
Thank you all- for your replies.Paul, I think I will explore more on
the SQL Server 2005 fuzzy logic.

SS
 
T

Tony Toews [MVP]

Does anyone know if MS Access 2007 uses fuzzy logic to find duplicate
records? We have this huge table with Staff details, which have tons
of duplicates due to typos in first name, last name, zip code, city
etc. So wanted to delete duplicates usig fuzzy logic, which would
detect that a record with "Jane Doe" and "Jane P. Doe" is the same if
they have the same SSN , DOB, Address , city,zip.

Are there any child records? If not then you could consider having
forms based on the duplicate records wizard query. That is one for
duplicate SSN. Another for duplicate DOB. Yet another for duplicate
phone number. And these forms display all the other data so a human
can make an intelligent decision.

As far as duplicate names go you can parse out the first chunk before
a space and the last chunk after a space, and, by using an update
query, update a first name field and a last name field. This won't be
perfect of course as 69 Camaro and others point out. Then use the
same form based on duplicate first and last name. You may want to
experiment with looking for dupes based on, say, the first two or
three characters of the first and last names. It's surprising just
how few "jo" "sm" there are in a sample database I have of 10,000
names.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top