find duplicates problem

  • Thread starter Thread starter tw
  • Start date Start date
T

tw

Hi,

My scenario is that i have 2 system with name and adress (100.000 names),
that have to be merged into 1 system without any duplicates.

The problem is that the spelling is not 100% between the system.

One way to find duplicate is to group name,adress and count > 1.

My dream is to use the sound index "Difference" so can i get around the
spelling problem.

DIFFERENCE
Returns the difference between the SOUNDEX values of two character
expressions as an integer.
Syntax
DIFFERENCE ( character_expression , character_expression )

Is that possible to use DIFFERENCE to find duplicates?

And how should the t-sql look like?

Example
name adress city
charles way1 state1
charle waj1 stat1
charlez vay1 stat1

I want to find this example, that this 3 is duplicates.

Should i use ordinary way with group and count >1, this would not be
duplicates.

Help

Thanx

Tw
 
Hi,


One of the problem is that the difference is not transitive.

Difference between wa and wb can be 1,
difference between wb and wc be 1,
but difference between wa and wc can be 2.

So, someone can say there is two groups, wa and wc, with wb member of either
of these,

or,

can say there is just one group, wb, with wa and wc also member of this
single group.


Another solution is to say there is three groups, wa, wb, and wc, and that 2
candidates are eventually found member of wa; 3, of wb; and 2, of wc.

That last solution seems easier to get. Untested, something like:


SELECT a.name, a.address, a.city, COUNT(*) as PotentialMemberCount
FORM myTable As a INNER JOIN myTable As b
ON Difference( a.name, b.name) < 2 AND
Difference( a.address, b.address) < 2 AND
Difference( a.city, b.city) < 2
GROUP BY a.name, a.address, a.city
HAVING COUNT(*) > 1


.... just "a" small problem, 3 * 100 000 * 100 000 evaluations of Difference
are involved. Probably not quite very good.


It would be probably easier to get an exact match on something restrictive
first. City name, as example, could easily come handy. Even if someone can
type NewYork, New York, New-York, and so on, you can officially claim it
should be New York, and you replace all other syntaxes, with the official
one. A 'no match' query will easily give you the not official typing, and
then, you can UPDATE them with the official typing. That being done, then
the query:


SELECT a.name, a.address, a.city, COUNT(*) as PotentialMemberCount
FORM myTable As a INNER JOIN myTable As b
ON a.city = b.city AND
Difference( a.name, b.name) < 2 AND
Difference( a.address, b.address) < 2
GROUP BY a.name, a.address, a.city
HAVING COUNT(*) > 1



is likely to be much faster, since difference will be computed 2 * SUM( Xi
^2) where Xi = number of people in city "i". As example, assume half the
people live in New York, (which is bad, for our performance, here) and the
other people live in 100 of other cities, that could make this query four
times faster than the previous one. That can still be an unacceptable time.
With MS SQL Server 2005, you can split the table "by city" on different
server, so the query is solved "in parallel", which is also a case where you
could save extra time. As example, if you split the table uniformly on 10
servers, disjoint by city (or groups of cities) so that each server has
about the same number of records, that can take 100 less time than the
initial query... (but still, remember you have mobilized 10 times more
resources). It may be preferable to relay on Postal Code, or something else,
with a finer granularity than City.



Hoping it may help,
Vanderghast, Access MVP
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Back
Top