How to best approach deleting records with both duplicate a addressand a duplicate name in a large f

Mike C · Mar 15, 2008

Hello,
I have a medium sized file (approx 30,000 records) that has many names
and addresses that have been entered twice. I need to find an
efficient way to delete all records that have both a duplicate name
and a duplicate address. Records with a duplicate name but not a
duplicate address should stay and vice versa.

Is there a best way to approach this? Should this typically be
handled with a query. Does the Find duplicates query handle this
problem well?

Thanks for any insight anyone can provide. I am sure this is a common
problem, but i have not had to do it before and my first few attempts
don't seem to work.

(I will also need to attempt to delete Like records as the next step,
but for now I guess I will just figure out how to get the main file's
exact duplicates out of the equation.)

Allen Browne · Mar 15, 2008

This example assumes:
- duplicate defined on combination of:
Surname + FirstName + Address + City
- you can ignore records that have nulls in those fields
- a primary key field of ClientID
- The record you want to keep is the one with the lower primary key value.

Use a subquery like this:
DELETE FROM tblClient
WHERE tblClient.ClientID <>
(SELECT Min(ClientID) AS MinClient
FROM tblClient AS Dupe
WHERE (Dupe.Surname = tblClient.Surname)
AND (Dupe.FirstName = tblClient.FirstName)
AND (Dupe.Address = tblClient.Address)
AND (Dupe.City = tblClient.City));

If subqueries are new, here's an introduction:
http://allenbrowne.com/subquery-01.html

gllincoln · Mar 15, 2008

Hi Mike,

Build one query at a time (warning air code - might be a typo or something)

(qs_DupeNames) SELECT * FROM myTable WHERE Name IN(SELECT Name FROM myTable AS tmp GROUP BY Name HAVING Count(*)>1);

(qs_DupeAddress) SELECT * FROM qs_DupeNames WHERE Address IN(SELECT Address FROM qs_DupeNames as tmp GROUP BY Address HAVING Count(*)>1);

Take a look at the output of qs_DupeAddress - if it looks right, you can change it to a DELETE * etc. and that takes care of that.

There is a way to join those two into a single query but I'm a little shaky on the syntax and frankly, I think the above is a lot cleaner/easier to follow.

Hope this helps,
Gordon

Sylvain Lafontaine · Mar 15, 2008

Sorry, but this code is not exactly right: it will delete all duplicates but without leaving behind one value. Usually, when you want to delete duplicates, you also want to keep at least one record from each duplicate.

--
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: sylvain aei ca (fill the blanks, no spam please)

Hi Mike,

Build one query at a time (warning air code - might be a typo or something)

(qs_DupeNames) SELECT * FROM myTable WHERE Name IN(SELECT Name FROM myTable AS tmp GROUP BY Name HAVING Count(*)>1);

(qs_DupeAddress) SELECT * FROM qs_DupeNames WHERE Address IN(SELECT Address FROM qs_DupeNames as tmp GROUP BY Address HAVING Count(*)>1);

Take a look at the output of qs_DupeAddress - if it looks right, you can change it to a DELETE * etc. and that takes care of that.

There is a way to join those two into a single query but I'm a little shaky on the syntax and frankly, I think the above is a lot cleaner/easier to follow.

Hope this helps,
Gordon

gllincoln · Mar 15, 2008

Hi Mike,

In my initial response, I told you how to find the dupes. I did fail to go into much detail regarding how you would select the ones to delete. That's because I was tired and because there are so many potential constraints to deal with that it's mind-boggling. Most important advice - make backups frequently (before every major alteration) and don't over-write the backups, make a collection of them like a janitor's key ring. Label/sort them so you can find what you need because it's a near certainty that you will mess up at least once and find yourself needing to restore whatever. Keep in mind that you may not realize when you mess up, not right away - you could be 3 or 4 steps down the line before you have your epiphany.

For one thing - if you have related transaction tables (and you probably do), then along with the deletions - you have to figure out what you want to do about the related records. How are you using this data, and how might it be used down the road? More than one person has discarded 'old data' or 'dupe data' only to regret it later - they didn't need the information at the time but then, six months or a year later - uh oh.

There is the dilemna of the phones and email contact info. The two records may reflect alternate contact numbers or emails - and no matter which way you go, if you keep one and dump the other - you will be wrong close to 50% of the time. Murphy's Law insures that the ones you get wrong, will be ones that matter.

Where you have dupe clients, you may have purchase histories, transactions, split across both clients. You can almost guarantee that, no matter which one you delete, 50% of the time customer service will need to (for warranty or support) access the 'orphaned' transaction - they will search for transactions with that client and have to tell the client - I can't find your purchase here - do you have your invoice number? <not fun for the svc rep or the client>

If these are sales leads - then the dupe might be the result of the prospect being in contact with the company on two separate ocasions, maybe talking to two different sales reps, regarding two different products or services. What is value of this transaction history? If you have a professional sales force, It can be substantial.

If prospect Jones calls back again, asking (again) about an item or new item - then the level of interest should be considered significantly greatly than a first contact - the sales rep should make every possible attempt to negotiate a close - the client has a sustained interest in doing business with your company; it's time to find out whatever the hidden object is and deal with it.

The sales rep could go to an upsell, explore the feasibility of bundling the products or services together with this one - maybe discounting the bundle to sweeten the deal while maintaining a comfortable transaction payout overall. The point being, these bits of extra insight/knowledge are like money in the bank if you have a skilled sales force. Just mentioning that previous contact (in context of the presentation) makes the customer feel that you take him seriously, that he is important to you and that you are a professional - he perceives that you have your 'stuff' together, you know what you are doing. This is a rather huge edge for the sales team.

You have to make some value judgements to make and research to do, before you hit the switch on this one. I'd make a list of every table in the application (and any applications that links to these tables) that uses the primary key of this table as a foreign key.

When you are done with this, you still won't be done. You will have some dupes where the address is almost identical but not quite Ste vs PMB vs #, one address has apt, next one says #, third skips the apt number altogether. (That is, unless your entries are being address corrected real time and if they were, you would have probably been catching dupes on entry - right? ) On the names side, you may have middle initial on one entry, none on the 2nd, and the whole middle name on the third. Checking for dupe emails and dupe phones is the best way to nail 98% of the 'rest of the dupes'.

Hope this helps,
Gordon

viktor chuzhakin · Mar 17, 2008

Mike C said:
Hello,
I have a medium sized file (approx 30,000 records) that has many names
and addresses that have been entered twice. I need to find an
efficient way to delete all records that have both a duplicate name
and a duplicate address. Records with a duplicate name but not a
duplicate address should stay and vice versa.

Is there a best way to approach this? Should this typically be
handled with a query. Does the Find duplicates query handle this
problem well?

Thanks for any insight anyone can provide. I am sure this is a common
problem, but i have not had to do it before and my first few attempts
don't seem to work.

(I will also need to attempt to delete Like records as the next step,
but for now I guess I will just figure out how to get the main file's
exact duplicates out of the equation.)

Writing SQL code to delete duplicate records	0	Feb 18, 2013
Duplicate records - eliminate from totals	11	Sep 15, 2009
Delete duplicate records in a table	1	Jan 28, 2007
Deleting Duplicate records	2	Feb 22, 2007
deleting duplicate records obtained from a duplicate query	4	Dec 15, 2009
Deleting duplicate records	3	Jan 21, 2010
Fuzzy logic to find duplicate Records	6	Feb 16, 2007
How to extract duplicate records from 4 spreadsheets into a separate spreedsheet?	2	Nov 23, 2012

How to best approach deleting records with both duplicate a addressand a duplicate name in a large f

Mike C

Allen Browne

gllincoln

Sylvain Lafontaine

gllincoln

viktor chuzhakin

Ask a Question

Similar Threads