Finding near duplicates by comparing two columns


Joined
Jul 1, 2012
Messages
1
Reaction score
0
Hope this isn't a duplicate (no pun intended) question. I did try to find the answer before posting. I did find similar questions but not excactly my dilemma.

My husband owns a DJ/Karaoke business. I'm quite technical so take care of quite a bit of the computer stuff. We have a constantly updating database of Karaoke music (Artist Name and Song Title) that I aim to have be duplication free. It's a challenge to be sure as the entries come in with spelling errors and the colums swapped to name just two.

I try to manually catch spelling errors in both the artist names and the song titles. But with 40,000 entries it's daunting. I use Excel mostly but can port it over to Access and have. Trying to find the best ways to help myself.

Basically here is my configuration: The artist name is in one column, not two. For example Jackson, Michael is in the same column as Jackson Five. I don't have a last name column and a first name column. Song title is one column.

My problems, to name a few, are as follows:

I might have the artist name come to me as Jackson Five or Jackson 5.

I might have Jackson, Michael come over in the Song column and the song title come in under the Artist Name.

I might have a Jackson, Michael song be titled slightly different (e.g. She's Out Of My Life or She Is Out Of My Life or (She Is) Out Of My Life vs Out of My Life (She Is).

I try to catch all of these anomolies with sorting this way and that but it's really hard. In order to use the built in Duplicate Remover I'd need to have a lot of this cleaned up and that sort of doesn't help me much.

What I'd like is a way to have a query that would go though the data and either return a list of items that get flagged into a new spreadsheet or somehow highlight those rows that meet the "DUP" criteria in the existing spreadsheet. I don't want it to do any deleting for me.

I'd like it to tell me that it found Jackson, Michael in both columns and show me where. I'd like to know if it also found Jackson, Micheal (spelled differently). I'd like to know that it found what it thinks are the same song with an xx% difference for the same artist.

I don't mind making several passes and that might make sense given the size. For example, go through the artist names and find slight differences in spelling. Then do the same thing for the song title. There are ligitimate song titles that are duplicates just done by different artists, I can manually "approve" those.

I know this has been long but I wanted to give lots of context. I'm also open to using Access if the tools are there over what Excel has to offer. I have both Excel 2003 and 2010.

I realize nothing is going to be 100% duplication free but it would be nice to get close. After I feel it's as good as it's going to get I'll create a way to make sure things are added without being a duplicate.

Thanks for listening :)

Gary
 
Ad

Advertisements

Joined
Mar 20, 2012
Messages
764
Reaction score
4
This sounds like a bit of a task indeed. If you have a specific list of artists you want to look for, you can build a separate formula for each, and then filter out for results to find the real artists and replace them as needed. You could use something like the following to find any conglomerate of Michael Jackson and display either "Jackson, Michael" or "Not Him". Let me know if this helps you out. It's not a total fix, but I think it will start you in the right direction.

=IF(OR(ISNUMBER(MATCH("M*l J*n",A1,0)),ISNUMBER(MATCH("J*n, M*l",A1,0))),"Jackson, Michael","Not Him")
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top