Finding duplicates that might differ slightly

  • Thread starter Thread starter KBuser
  • Start date Start date
K

KBuser

I'm trying to create a query to find duplicate records, but some
entries might contain a . or a '. How can I account for this in my
query?

SELECT BrandName.BrandName, BrandName.Suffix, BrandName.ProductClass,
BrandName.Company, BrandName.updateDate
FROM BrandName
WHERE (((BrandName.BrandName) In (SELECT [BrandName] FROM [BrandName]
As Tmp GROUP BY [BrandName],[Suffix] HAVING Count(*)>1 And [Suffix] =
[BrandName].[Suffix])))
ORDER BY BrandName.BrandName, BrandName.Suffix;


Thanks in advance.
 
Access is quite literal ... "duplicate" means an exact match, character for
character. Unless you can tell Access exactly how to decide what is/isn't a
duplicate, you may have to rely on USB (using someone's brain).

You and I (?!) can look at a list of CompanyNames and easily spot those that
are (probable) duplicates. For example, would you consider the following to
be duplicates?:

IBM
International Business Machines
IBM, Inc.
International Business Machines Corporation

If you or I weren't quite sure, we'd probably look at one/more of the
additional fields to help confirm/reject them as duplicates. Unless
instructed how, Access can't do that.

If you have 10 million records, you probably won't want to try this. If you
have 10,000 records or less, you might want to use a query to sort the
field(s) in alpha sequence, then eyeball the list.

Or, you could try to explicitly define how to evaluate each possible
"acceptable" difference that nonetheless still results in "duplicates".

By the way, if all four of those business names I listed are the same
business, which one is the "correct" name? Why does that matter? Because
any related tables/records will have to also be converted over to use the
record ID of the correct name!

Regards

Jeff Boyce
<Office/Access MVP>
 
I would consider a duplicate to be Seagram's and Seagrams. Or K.R.B and
KRB to be the same. There are about 20,000 in this table, which is
constantly updating, so I'll need queries that can be run in the
future. I don't want the queries themselves to do any editting, just
return the results so that I may decide on the best course of action.
As it is, there are about 260 exact duplicates and 389 brands with an '
somewhere in the name.
I'm not worried about the examples you gave, as those would be too
difficult at this juncture to automate a search for. Also, the
occurence of such a situation is likely very low.
 
The expression:
replace(replace([brandName],".",""),"'","")
will give you a string with all dots and 's removed. You can then look for
duplicates of that.
hope that helps
-john
 
Thanks, where in my original statement should I place that. Also, will
that edit the table that is being queried or just create a temporary
string/array to query against?
 
What you are requesting is known as a "fuzzy match" and is one of the most
difficult problems in data processing - not amenable to SQL queries, which,
as Jeff Boyce states, are quite literal minded. I once worked on this problem
for a client, an a realiable solution required thousands of lines of C code.
 
You could put the expression in a query, but it would make it be super slow.
What I would do (maybe there's a better way) would be to create a new field
in the table, called, say CleanName. Run an update query to have this field
updated to the replace expression I gave you earlier. Then change your find
duplicates query to look at CleanName rather than BrandName. Another thing
you can do is to make sure there are no leading or trailing spaces with
trim() and also maybe just take the first 12 or so characters of the field,
which can be useful if you have stuff like:
Malcom B Smith, Jr.
Malcom B. Smith junior
You can also add more replaces to get rid of internal spaces or commas.
for example:
trim(left(replace(replace(replace([brandname],".",""),"'","")," ",""),12))
should get rid of dots, commas and spaces, and take the first 12 characters,
thus making both Malcom's above come out the same.
(the trim is redundant since we already got rid of spaces, but i left it in
as an example)

hope this helps
-John
 
Slick, John!

Though, given the perversity of human data entry, you'd probably need to
expand that approach to include (exclude) all spaces (some heavy-thumbed
persons put two spaces instead of one).

And I'm still not sure how the original poster is going to handle spelling
and transposition (e.g.):
BrandEx
BrandX
BarndX

Jeff Boyce
<Office/Access MVP>

John Welch said:
The expression:
replace(replace([brandName],".",""),"'","")
will give you a string with all dots and 's removed. You can then look
for duplicates of that.
hope that helps
-john

KBuser said:
I'm trying to create a query to find duplicate records, but some
entries might contain a . or a '. How can I account for this in my
query?

SELECT BrandName.BrandName, BrandName.Suffix, BrandName.ProductClass,
BrandName.Company, BrandName.updateDate
FROM BrandName
WHERE (((BrandName.BrandName) In (SELECT [BrandName] FROM [BrandName]
As Tmp GROUP BY [BrandName],[Suffix] HAVING Count(*)>1 And [Suffix] =
[BrandName].[Suffix])))
ORDER BY BrandName.BrandName, BrandName.Suffix;


Thanks in advance.
 
I was actually able to do what I wanted before I got any replies,
however not the way I wanted to do it. I created a duplicate field
(BrandName2) and in that field did a simple find and replace for all
the characters I wanted to take out. Then I ran my find dupe query on
that field, hid it, and showed the original field. I'm still hoping to
find a way to automate it, as this database will be constantly
updating.

Thanks to all for your help.

KrB.
 
Back
Top