Data Accuracy Database Question 1 - Checking For Duplicate Records

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Good afternoon, I hope some can assist me.

I am using Access 2000 and have been asked to write a new database based on
below. Although I can understand VB code, I can not write it (but could use
sample code if it was provided)

A user enters a new contact on the system. If the contact name name is
exactly the same, it will be identified as a duplicate record. However Mr D.
Johnson would be seen as different to Mr D Johnson (No full stop). All of
these records are transported into an Access table and that is where I come
in. I have been asked to analyse the records to find duplicate accounts.
However, I can not think how access could do partial matches e.g. Record 1 is
95% like record 2 so it is probably a duplicate record. The other problem
with this is, I would (I ssume) need to check whether the address is also the
same e.g. Address 1, Address 2 etc which again comes down to how some one
might enter lines. Could some please give me some guidance? in terms of
possible code etc
 
Access doesn't provide fuzzy logic. You're limited to <, >, = and <>
(not equal). The kinds of problem you're being asked to solve require
human intelligence to decide what the problems might be and to then
massage the existing records for comparison. Then you have to make
decisions and take corrective action.

Since detecting and correcting data errors is so expensive it's
important to do all that you can to assure that the data is correct at
the time of entry. You can put lots of data validation in your forms
and controls. A bit of logic can go into the tables, not much though.
You can beat up on your Users (well, educate and encourage would
probably get you further and keep life more pleasant).

HTH
 
Larry is right on,
Here is a very pragmatic solution (but not perfect, and of limited use for
very large datasets)

1. Get all your data into access
2. Make sure you have parsed out all the fields you can (I find Excel
with the Text --> Data function most useful here)
(I'll call the table Table1)
recordID (autoInc, long) (key)
firstname
lastname
address1
address2
city
state
zip
telephone
etc.
what you are after is to get as many 'bins' as you can
before you try to do your matching.
Build another table (TableMatches) (three fields) FirstRecordID (long) and
MatchedRecordID (long) MatchRule (key, combine the two fields)

Here are some rules that have worked for me in similar situations (warning
they are not 100% and you have to eventually look at the data to make sure
all the matched values are valid)

Rule1: if all fields match then create a match
Rule2: if zip and telephone match then create a match
Rule3: if zip and Address1 match then create a match
---avoid
Rule(bad) if Firstnames and LastNames match then create a match (too many
John Smiths out there).

I have needed to get down to something like:
RuleN: If zip and the first six characters of address1 match then call it a
match

You need to make up rules that make sense from the data.

I will give a 'walkthrough' to build rule2, after that your on you own!
Open a new query
add Table1, then add another copy of Table1 (it will show as table11)
Select the zip code in Table1 and connect a relationship to Table11 on zip
Select the Telephone in Table1 and connect a relationship to Table11 on
telephone
add the RecordID from both tables
change to an append query (append records to TableMatches) (make sure you
have the two id fields going into their respective fields in TableMatches)
Right click on the designer page and in the pop-up menu select Unique Values
(you only want one copy of the matched records)
run the query (append the rows)
save the query as "rule2"

Now build a query based on the TableMatches , Table1, Table11 (copy of
Table1)
Link from FirstRecordId to Table1.RecordID and MatchedReacorID to
Table11.RecordID
show the desired fields from the two copies of Table1 e.g. FirstName,
LastName from table1 and FirstName, LastName from Table11
Run the query
Do the matches 'look right'? If so you have at least a start.

Now implement your rules as above and proceed.

Note, reminder, caution
The above matches are 'rough cuts' you will have to review the matched
records for consistency.
You can also delete any 'matches' that are to the same record ( id1 = 1 and
id2 =1) and you can remove any mirrored cases (e.g. id1=1 id2=2 ; Id1=2
id2=1) They are really the same 'match'

Lots of luck

Ed Warren
 
Thank you Ed and Larry for your kind and thoughtful replie. Your quite
right, think I was over complicating the issue by trying to use some
"magical" code for finding partial duplicates. I actually found some code
when searching for "partial duplicates" which I converted in to;

SELECT [tblImported].[ContactName], [tblMaster].[ContactName],
[tblmaster].[CustomerID], [tblmaster].[Title], [tblmaster].[FirstName],
[tblmaster].[LastName], [tblmaster].[JobTitle], [tblmaster].[CompanyName]
FROM tblMaster INNER JOIN tblImported ON [tblMaster].[ContactName] Like "*"
& [tblImported].[ContactName] & "*"
WITH OWNERACCESS OPTION;

Not sure why this works but it could spot for example D JOHNSON as a record
being a ""likely" duplicate of ED JOHNSON. I wonder whether the above
technique would work in conjuction with your logic.
 
The query below says if the contact name in the master table contains the
same string as the contact name in the tableImported then it's a match.
Therefore in the imported table you have
D JOHNSON and that string is matched in the master table with eD JOHNSON.
However, the problem with this is that it will miss the most likely case
where you have
ED JOHNSON AND E. JOHNSON.

Yes any query you can build that makes sense to you will work. My
experience is to try to use any 'coded' fields (zip, telephone, etc.) that I
have first. They tend to be more consistent than things link names and
addresses. Also one has to keep in mind what/why they are a match. For
example if the purpose is to remove duplicate mailing's to the same address
you don't care if you have Sam Johnson and Mary Johnson as entries, in this
case the same address is adequate to call a match. Whereas, if your purpose
is to identify unique individuals for a census or some other purpose you
would not want Sam and Mary to be called a match.

Ed Warren.

WishIKnewMore said:
Thank you Ed and Larry for your kind and thoughtful replie. Your quite
right, think I was over complicating the issue by trying to use some
"magical" code for finding partial duplicates. I actually found some code
when searching for "partial duplicates" which I converted in to;

SELECT [tblImported].[ContactName], [tblMaster].[ContactName],
[tblmaster].[CustomerID], [tblmaster].[Title], [tblmaster].[FirstName],
[tblmaster].[LastName], [tblmaster].[JobTitle], [tblmaster].[CompanyName]
FROM tblMaster INNER JOIN tblImported ON [tblMaster].[ContactName] Like
"*"
& [tblImported].[ContactName] & "*"
WITH OWNERACCESS OPTION;

Not sure why this works but it could spot for example D JOHNSON as a
record
being a ""likely" duplicate of ED JOHNSON. I wonder whether the above
technique would work in conjuction with your logic.


Ed Warren said:
Larry is right on,
Here is a very pragmatic solution (but not perfect, and of limited use
for
very large datasets)

1. Get all your data into access
2. Make sure you have parsed out all the fields you can (I find Excel
with the Text --> Data function most useful here)
(I'll call the table Table1)
recordID (autoInc, long) (key)
firstname
lastname
address1
address2
city
state
zip
telephone
etc.
what you are after is to get as many 'bins' as you can
before you try to do your matching.
Build another table (TableMatches) (three fields) FirstRecordID (long)
and
MatchedRecordID (long) MatchRule (key, combine the two fields)

Here are some rules that have worked for me in similar situations
(warning
they are not 100% and you have to eventually look at the data to make
sure
all the matched values are valid)

Rule1: if all fields match then create a match
Rule2: if zip and telephone match then create a match
Rule3: if zip and Address1 match then create a match
---avoid
Rule(bad) if Firstnames and LastNames match then create a match (too many
John Smiths out there).

I have needed to get down to something like:
RuleN: If zip and the first six characters of address1 match then call
it a
match

You need to make up rules that make sense from the data.

I will give a 'walkthrough' to build rule2, after that your on you own!
Open a new query
add Table1, then add another copy of Table1 (it will show as table11)
Select the zip code in Table1 and connect a relationship to Table11 on
zip
Select the Telephone in Table1 and connect a relationship to Table11 on
telephone
add the RecordID from both tables
change to an append query (append records to TableMatches) (make sure you
have the two id fields going into their respective fields in
TableMatches)
Right click on the designer page and in the pop-up menu select Unique
Values
(you only want one copy of the matched records)
run the query (append the rows)
save the query as "rule2"

Now build a query based on the TableMatches , Table1, Table11 (copy of
Table1)
Link from FirstRecordId to Table1.RecordID and MatchedReacorID to
Table11.RecordID
show the desired fields from the two copies of Table1 e.g. FirstName,
LastName from table1 and FirstName, LastName from Table11
Run the query
Do the matches 'look right'? If so you have at least a start.

Now implement your rules as above and proceed.

Note, reminder, caution
The above matches are 'rough cuts' you will have to review the matched
records for consistency.
You can also delete any 'matches' that are to the same record ( id1 = 1
and
id2 =1) and you can remove any mirrored cases (e.g. id1=1 id2=2 ; Id1=2
id2=1) They are really the same 'match'

Lots of luck

Ed Warren
 
You can use the Soundex function to help you locate names that are spelled
similar. Add a new column to the table named "SoundexCode" (Text, 4
characters). Run an update query to populate this new field. Then run a
grouped query that displays the SoundexCode and counts occurances of each
SoundexCode with count > 1 criteria. Save this query. Create another query
that uses your table of names. Add the saved group query. Join on the
SoundexCode fields. Add fields from the table to the QBE grid. When you run
this query, it should identify those records that are possible Soundex
matches. Sort the query using the name or SoundexCode fields.

To learn more about the Soundex function, check out Access MVP Doug Steele's
April 2005 column "Close" only counts in horseshoes...and databases:
http://www.accessmvp.com/djsteele/SmartAccess.html


The trick is to create a working Soundex function (it's in the Access
Developer's Handbook, if you happen to have a copy), and then call this
function from a query. Start with a SELECT query. When you get it working
properly, convert it into an Update query, so that you can update the values
in the new SoundexCode field in your table.


Good Luck!

Tom

http://www.access.qbuilt.com/html/expert_contributors.html
__________________________________________


:

Good afternoon, I hope some can assist me.

I am using Access 2000 and have been asked to write a new database based on
below. Although I can understand VB code, I can not write it (but could use
sample code if it was provided)

A user enters a new contact on the system. If the contact name name is
exactly the same, it will be identified as a duplicate record. However Mr D.
Johnson would be seen as different to Mr D Johnson (No full stop). All of
these records are transported into an Access table and that is where I come
in. I have been asked to analyse the records to find duplicate accounts.
However, I can not think how access could do partial matches e.g. Record 1 is
95% like record 2 so it is probably a duplicate record. The other problem
with this is, I would (I ssume) need to check whether the address is also the
same e.g. Address 1, Address 2 etc which again comes down to how some one
might enter lines. Could some please give me some guidance? in terms of
possible code etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top