find duplicates problem

tw · Jul 25, 2005

Hi,

My scenario is that i have 2 system with name and adress (100.000 names),
that have to be merged into 1 system without any duplicates.

The problem is that the spelling is not 100% between the system.

One way to find duplicate is to group name,adress and count > 1.

My dream is to use the sound index "Difference" so can i get around the
spelling problem.

DIFFERENCE
Returns the difference between the SOUNDEX values of two character
expressions as an integer.
Syntax
DIFFERENCE ( character_expression , character_expression )

Is that possible to use DIFFERENCE to find duplicates?

And how should the t-sql look like?

Example
name adress city
charles way1 state1
charle waj1 stat1
charlez vay1 stat1

I want to find this example, that this 3 is duplicates.

Should i use ordinary way with group and count >1, this would not be
duplicates.

Help

Thanx

Tw

Michel Walsh · Jul 25, 2005

Hi,

One of the problem is that the difference is not transitive.

Difference between wa and wb can be 1,
difference between wb and wc be 1,
but difference between wa and wc can be 2.

So, someone can say there is two groups, wa and wc, with wb member of either
of these,

or,

can say there is just one group, wb, with wa and wc also member of this
single group.

Another solution is to say there is three groups, wa, wb, and wc, and that 2
candidates are eventually found member of wa; 3, of wb; and 2, of wc.

That last solution seems easier to get. Untested, something like:

SELECT a.name, a.address, a.city, COUNT(*) as PotentialMemberCount
FORM myTable As a INNER JOIN myTable As b
ON Difference( a.name, b.name) < 2 AND
Difference( a.address, b.address) < 2 AND
Difference( a.city, b.city) < 2
GROUP BY a.name, a.address, a.city
HAVING COUNT(*) > 1

.... just "a" small problem, 3 * 100 000 * 100 000 evaluations of Difference
are involved. Probably not quite very good.

It would be probably easier to get an exact match on something restrictive
first. City name, as example, could easily come handy. Even if someone can
type NewYork, New York, New-York, and so on, you can officially claim it
should be New York, and you replace all other syntaxes, with the official
one. A 'no match' query will easily give you the not official typing, and
then, you can UPDATE them with the official typing. That being done, then
the query:

SELECT a.name, a.address, a.city, COUNT(*) as PotentialMemberCount
FORM myTable As a INNER JOIN myTable As b
ON a.city = b.city AND
Difference( a.name, b.name) < 2 AND
Difference( a.address, b.address) < 2
GROUP BY a.name, a.address, a.city
HAVING COUNT(*) > 1

is likely to be much faster, since difference will be computed 2 * SUM( Xi
^2) where Xi = number of people in city "i". As example, assume half the
people live in New York, (which is bad, for our performance, here) and the
other people live in 100 of other cities, that could make this query four
times faster than the previous one. That can still be an unacceptable time.
With MS SQL Server 2005, you can split the table "by city" on different
server, so the query is solved "in parallel", which is also a case where you
could save extra time. As example, if you split the table uniformly on 10
servers, disjoint by city (or groups of cities) so that each server has
about the same number of records, that can take 100 less time than the
initial query... (but still, remember you have mobilized 10 times more
resources). It may be preferable to relay on Postal Code, or something else,
with a finer granularity than City.

Hoping it may help,
Vanderghast, Access MVP

Finding a count of duplicates	4	Jul 29, 2009
Duplicate Query	7	Jan 14, 2010
Duplicates in union query	3	Nov 22, 2012
Excel Subtotal without calculate duplicate lines	4	Feb 22, 2017
Selecting unique record from duplicate	2	Mar 24, 2010
Access Compare records in same access table	0	Mar 20, 2018
using duplicate function	13	May 6, 2013
find duplicates	1	Mar 1, 2006

find duplicates problem

tw

Michel Walsh

Ask a Question

Similar Threads