Cleaning up a GIANT mailing list, with many duplicate names

G

Guest

Hi All,

I'm a new employee, and my new boss has handed me an enormous mailing list
in Excel. I'm hoping I can clean it up using Access, but I'm not super
familiar with it.

This mailing list is the compilation of numerous others, so it has duplicate
entries for many people. I need to eliminate the duplications of names, but
retain the unique info present in the other cells in the same row as the
duplicate entry, like what mailing list that duplicate name originated from,
the date that person signed up for that list, etc.

I thought about doing it by hand, but there must a way to automate this -
there are over ten-thousand rows in this spreadshet. I need help!

Thanks - P
 
J

Jeff Boyce

In looking for an "automatic" way to do this, consider the following:

John Doe
12345 Elm St

J. Doe
12345 Elm Street

John J. Doe
12345 Elm St SW

J.J. Doe
12354 Elm

Access will treat a comparison of these as all being different. Will you?

You may find it easier to sort all the rows by address, then by person.
Once you have this list, I'd suggest USB - using someone's brain - to
determine "duplicates". If you add a column to your table to check Yes/No
for "duplicate", a subsequent query will allow you to select those not
checked.

Good luck!

Jeff Boyce
<Access MVP>
 
E

Ed Warren

1. In your master table you will need an Unique Key value for each row, this
is easy just use an autoicremented key.
2. Next you will need another table with two columns (RecordID1, RecordID2).
In this table you will store the 'duplicates'. Example Record Id 10
duplicates the name in Record ID 101 etc.


In addition to Jeff's suggestions, I've found that looking for duplicates
using the last seven characters of the telephone number a good go.

For example ask the query. "Give me a count of the records having the last
seven characters (or whatever is appropriate for your circumstance) of the
telephone number the same and report any with a count greater than 1".

The next cut I make is to look for duplicates with zipcode and street number
eg zip 90009, first part of the address (In Jeff's example "12345". To do
this right you have to know how to 'parse' (find the 'number part' of the
address) (123 elm --> 123, whereas 1234 elm -> 1234, etc). Then count the
duplicates there.

My experience is that these two cuts will do about 90% of the work of
identifying duplicate addresses.

Now the real problem comes in when you find that:
John Doe = J. Doe below but the 'remaining fields' are different you
have to then resolve which set to keep for the "Master table"

Lots of Luck

Ed Warren.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top