Merging duplicate entries in one table

G

Guest

Hello,
I work in a theatre and we use access for our mailing list and a FoxPro
program for our ticketing software. The FoxPro program is not versatile when
it comes to database maintenance (Choice ticketing). It is limited to
deleting and changing customers one by one. Unfortunately, someone added
several mailing lists to our ticketing software without deleting duplicates
in access or excel first. I have 27,000 entries in that program which I
would guess about 10,000 are duplicates.

My goal is to have one mailing list in the office which has any additional
data extracted from Choice but without duplicates.

So far I have made some progress. My [Main] list and my [Choice] list are
individually de-duped and uniform. By uniform I mean they all have
consistent formatting (i.e. ave. is avenue and st. is street or saint, all
mr. or mrs. are put in a prefix field, etc.) I deduped everything manually
using different duplicate queries like first, last/last, address/first,
address/and phone number.

I then merged main with choice, by making sure all the columns were uniform
and appending data from one to the other. I am using checkboxes for fields
that determine what type of shows someone is interested in, what shows they
attended, if they have donated money, etc. I have an extremely daunting
task. I have several thousand duplicates - but all the checkboxes do not
match up.

I want to merge my entries like so:

When [Last] & [Address] is the same
Choose [Phone] with number over [Phone] with no number
And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre]
If [First] has different values then keep [First] with phone number

Is that possible? I'm not experienced in sql or the union function but can
follow instructions.

Thanks in advance for any help
Charles
 
G

Guest

Charles:

Firstly treat the following as guidance only as its difficult to be
categorical without seeing exactly how the data is currently represented.

Start by selecting all distinct Last and Address values from your combined
Main and Choice tables, which I've called YourTable below:

SELECT DISTINCT Last, Address
FROM YourTable ;

Insert these rows into a new table which you've already designed with
columns CustomerID (autonumber), First, Last, Address and Phone. I've called
this table NewTable below.

Select all rows from YourTable where the phone column has a number:

SELECT Last, Address, First, Phone
FROM YourTable
WHERE Phone IS NOT NULL;

Join this to NewTable on the Last and Address columns and UPDATE the Fisrt
and Phone columns in NewTable with the values of the First and Phone columns
from YourTable.

Select all rows from YourTable with empty (NULL) Phone columns where a row
does not yet exist in NewTable with a First value for the current Last and
Address values:

SELECT Last, Address, First
FROM YourTable
WHERE Phone IS NULL
AND NOT EXISTS
(SELECT *
FROM NewTable
WHERE NewTable.Last= YourTable.Last
AND NewTable.Address = Yourtable.Address
AND NewTable.First IS NULL);

Join this to NewTable on the Last and Address columns and UPDATE the First
column in NewTable with the value of the First column from YourTable.

That should create a list of all customers. As regards the type of shows
someone is interested in, what shows they attended, if they have donated
money, etc, having separate Boolean (Yes/No) columns for each of these is
completely the wrong design in a relational databse. You should have
separate tables of Shows etc. You then need to model the many-to-many
relationships between them and NewTable (lets now call it Customers) with
other tables which each contain foreign key columns referencing the primary
keys of the other two, so a ShowsAttended table would have columns CustomerID
and ShowID. Its difficult to give precise guidance how to populate these
tables without knowing exactly how this data is currently represented, but
essentially you will need to execute a series of 'append' queries which join
the new Customers table to the original YourTable table on the Last and
Address columns and inserts rows into tables such as ShowsAttended.

Ken Sheridan
Stafford, England

CharlesZ said:
Hello,
I work in a theatre and we use access for our mailing list and a FoxPro
program for our ticketing software. The FoxPro program is not versatile when
it comes to database maintenance (Choice ticketing). It is limited to
deleting and changing customers one by one. Unfortunately, someone added
several mailing lists to our ticketing software without deleting duplicates
in access or excel first. I have 27,000 entries in that program which I
would guess about 10,000 are duplicates.

My goal is to have one mailing list in the office which has any additional
data extracted from Choice but without duplicates.

So far I have made some progress. My [Main] list and my [Choice] list are
individually de-duped and uniform. By uniform I mean they all have
consistent formatting (i.e. ave. is avenue and st. is street or saint, all
mr. or mrs. are put in a prefix field, etc.) I deduped everything manually
using different duplicate queries like first, last/last, address/first,
address/and phone number.

I then merged main with choice, by making sure all the columns were uniform
and appending data from one to the other. I am using checkboxes for fields
that determine what type of shows someone is interested in, what shows they
attended, if they have donated money, etc. I have an extremely daunting
task. I have several thousand duplicates - but all the checkboxes do not
match up.

I want to merge my entries like so:

When [Last] & [Address] is the same
Choose [Phone] with number over [Phone] with no number
And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre]
If [First] has different values then keep [First] with phone number

Is that possible? I'm not experienced in sql or the union function but can
follow instructions.

Thanks in advance for any help
Charles
 
G

Guest

Ken,
I am not sure if I really understood your answer. I am truly a beginner
with Access. Let me clarify my situation a little.

I am not the Marketing Director at my theatre but I think he is pretty
satisfied with the layout of the master list. We have about 100 shows every
year which all fall under about 7 different categories. It would be annoying
to have to do a query for 30 different shows over a period of 4 years if we
were just looking for all of our customers who were interested in bluegrass.
Currently our mail goes to customers who have expressed interest in a certain
type of show. I would like to add to that list people who have bought
tickets to a certain type of show as well as new columns for the best
attended shows. That information is contained in the "Choice List". The
mailing list that we use is the "Mainlist".

I guess I was hoping for a step by step procedure for implementing a sql
code that would merge the duplicates as I said originally:
When [Last] & [Address] is the same
Choose [Phone] with number over [Phone] with no number
And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre]
If [First] has different values then keep [First] with phone number

I recognize that this may be difficult, so I got to thinking about another
way that I could do this. First I would create a find duplicates query which
would show one yes/no column (FILM for instance). Then I would append code
to the original query to find duplicates which have different Yes/No values
for that column. Since one would be yes and one would be no and I would
prefer a yes value over a no value I would simply copy and paste all the yes
values on to the no values and then do a similar query for the next column.
That way all of my data would be uniform across the board and I could start
deleting duplicate values without worrying about lost data.

Hopefully you could help me out with that?

Thank you,

Charles
 
G

Guest

Charles:

It might help if we go back to first principles and quickly outline how
relational databases model the part of the real world they are concerned with
and how this applies in your case.

The database relational model was originally developed around 1970 by Codd
and has become the dominant model for business databases.. Over the years the
model has been developed by a vast amount of published research, and the SQL
language has similarly undergone ongoing development. The important thing,
however, is that the model is based on firm scientific principles, so when
the 'rules' of the model are respected applications can be relied upon to
work correctly.

A well designed relational database identifies entity types in the real
world and represents each of these as tables. The attributes of an entity
type are represented by columns in the tables. Its important that there is
no redundancy (or at least as little as is possible without losing
information) in the tables and this is ensured by means of an analytical
processknown as normalization. Tables are said to be in 'normal forms' which
range from First Normal Form (1NF) to Fifth Normal Form (5NF). Other normal
forms beyond this have been identified, but these are rather bizarre in
nature and we don't need to concern ourselves with them here. In fact for
the momemnt we need only concern ourselves with the first three.

Normalization is defined in terms of the functional dependencies within a
table (and in the case of 5NF across tables). A column in a table is sid to
be functionally dependent on another column (or columns) if the value in the
column is determined by the value in the other column. If I had CustomerID
42 in your database then that determines my FirstName value as 'Ken' and
nothing else and my LastName value as 'Sheridan' and nothing else because the
FirstName and LastName columns in a Cutomers table are functionally dependent
on the CustomerID column. To be in Third Normal Form all non-key field in
the table must be functionally dependent on the whole of the primary key and
nothing but the primary key. So a Customers table with these three columns
is fine as the two non-key columns FirstName and LastName are dependent on
the primary key CustomerID and nothing else.

Say I go to some shows and you have a ShowsAttended table with columns
ShowName, CustomerID, FirstName and LastName. This table is not correctly
normalized because FirstName and LastName are not functionally dependent on
the whole of the table's primary key (The combination of Showname and
CutsomerID in this case), but on part of it (CustomerID). It would be
perfectly possible to have CustomerID 42 and completely different FirstName
and LastName values to mine in the same row.

If we take out the FirstName and LastName columns from this shows table then
the CustomerID maps back to my row in Customers so my names are always
correct. But the table is still not good enough. Its theoretically possible
there might be two different shows with the same name (this certainly happens
in the cinema), so we don't know which one I actually attended. Also it
would be very easy for the Showname to be entered inconsistently in different
rows (I once found three versions of myself amongst the authors of technical
articles in a database I worked on!). What we need is a Shows table with
ShowID (a unique number) and ShowName columns. In ShowsAttended we the have
CustomerID and ShowID columns, plus columns such as DateAttended. This
ShowsAttended table is actually modelling a special kind of entity type
because it is a relationship type between Customers and Shows modelling the
many-to-many relationship between those entity types.

When it comes to types of shows, this is another entity type so we need a
Categories table with columns CategoryID, Category. A show might be
categorised as more than one type, however, Jazz and Blues say, so we need a
table, ShowCategories, to model the many-to-many relationship between Shows
and Categories, with columns ShowID and CategoryID.

With his set of tables we are able to extract whatever information we need
from the data because it is a solid model of the part of the real world we
are interested in. To find out which customers attended shows categorised as
Blues or Country say we'd join the Customers, ShowsAttended, Shows,
ShowCategories and Categories in a query and restrict that query's result set
to rows where the Category was Blues or Country.

Customer interests regardless of whether they've attended any shows of those
categories is another entity type, so is modelled by another table
CustomerInterests with columns CustomerID and CategoryID. So if Jazz is
category 6 and Blues is category 9 and I' have interests in each there would
be two rows in this table for me with values 42;6 and 42:9. This table can
then also be incorporated in the query so it returns not only customers
who’ve attended shows of particular categories, but also who've expressed an
interest in them.

I hope this has given you an idea of how the sort of real world scenario you
are dealing with can be robustly modelled in a relational database. The way
of modelling it by means of a table with many different Boolean
(Yes/No)columns to represent customer interests is not only fundamentally
wrong in a relational database, but would make it very difficult to extract
useful information. The beauty of a solidly designed logical model is that,
because it is a scientifically sound representation of the underlying
realities, it enables you to extract whatever information those realities can
provide. The tools for doing this, particularly the SQL language, provide a
very powerful and flexible armoury for doing so.

If I were in your shoes and faced with the task of producing a solid
database which will not only fulfil your immediate requirements but provide a
firm basis for adding other functionality, the first thing I would do would
be to design the logical model in terms of the entity types and relationships
between them. I'd do this, not on my machine, but with a good supply of
pencils and paper, drawing out the model diagrammatically, and testing it by
asking whether it can not only give me the right answers, but also whether
there is anything in the design which might lead it to give me the wrong
answers, which is what results from a badly normalized model. Only then
would I build the 'back end' database as a set of empty tables and the
relationships between them along with the 'front end' interface for entering,
editing and querying the database. Finally I would transfer the data from its
present sources using various 'append' and 'update' queries. The exact
nature of these would depend on the final design of the logical model and the
nature of the existing source data, but the SQL statements I posted in my
earlier post show the general principles involved in the first stage,
producing the Customers table itself. Inserting the Shows and Categories
data should not be difficult. The trickier work is in populating the other
tables which represent the relationships between these main entity types.
The process tends to be tedious rather than difficult, however.

Ken Sheridan
Stafford, England

CharlesZ said:
Ken,
I am not sure if I really understood your answer. I am truly a beginner
with Access. Let me clarify my situation a little.

I am not the Marketing Director at my theatre but I think he is pretty
satisfied with the layout of the master list. We have about 100 shows every
year which all fall under about 7 different categories. It would be annoying
to have to do a query for 30 different shows over a period of 4 years if we
were just looking for all of our customers who were interested in bluegrass.
Currently our mail goes to customers who have expressed interest in a certain
type of show. I would like to add to that list people who have bought
tickets to a certain type of show as well as new columns for the best
attended shows. That information is contained in the "Choice List". The
mailing list that we use is the "Mainlist".

I guess I was hoping for a step by step procedure for implementing a sql
code that would merge the duplicates as I said originally:
When [Last] & [Address] is the same
Choose [Phone] with number over [Phone] with no number
And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre]
If [First] has different values then keep [First] with phone number

I recognize that this may be difficult, so I got to thinking about another
way that I could do this. First I would create a find duplicates query which
would show one yes/no column (FILM for instance). Then I would append code
to the original query to find duplicates which have different Yes/No values
for that column. Since one would be yes and one would be no and I would
prefer a yes value over a no value I would simply copy and paste all the yes
values on to the no values and then do a similar query for the next column.
That way all of my data would be uniform across the board and I could start
deleting duplicate values without worrying about lost data.

Hopefully you could help me out with that?

Thank you,

Charles
 
G

Guest

Ken,
As helpful and interesting as that was it didn't really answer my question
which was: Can you run a find duplicates query which only returns results in
which the duplicates have different values in one specific yes/no field?

At my theatre we run Choice ticketing software which does have a relational
database in it. I am able to look up any customer and find out what shows
they attended and when you enter a new event in the system you do in fact
have to come up with a unique id for that code and you can select a price
code or create a new one to associate it with. You can also say what
interest codes you want associated with that event. When a customer buys a
ticket it will automatically put down an interest code or two for them. You
can also manually type in interest codes for a specific customer if they call
you and say they suddenly have an interest in dance, where they never did
before. You can run queries that rely on all this data. It is a very
complete program - but you have to change, merge, and delete customers one by
one. My problem is that I am dealing with 10,000 duplicates and some missing
information where non-customers called the administrative office to be put on
the mailing list. My other problem is that I need to come up with a
comprehensive mailing list in just a couple months. My plan is this:

I want a mailing list by interest code (forget shows and accounting.. our
ticketing software will take care of that) I simply want to fill the 275
seats per show by sending out advertisements to those with interest in
certain types of shows.

I am going to delete all non-ticket buyers from our ticketing software,
eventually, and eventually I am going to put back names of non-ticket buyers
once i have taken out all the duplicates in access. Once I have reached that
final goal - we can safely use the ticketing software for a more complete
relational database.

My first task at hand is to clean up the mailing list before I begin to work
on a relational database in our ticketing software or in access.

Thanks,
Charles
 
G

Guest

Charles:

I think I have answered your question to a greater degree than you realize,
but I'll try and spell it out in more detail below.

The way to eliminate duplicates is not to get rid of the redundant rows from
a badly designed table, which just gives you a badly designed table without
duplicates, but to insert the data into a properly designed set of tables.
That way the system will do the work for you. For customers' interests you
need a simple schema of the following tables: Customers, CustomerInterests,
Categories.

The Customers table should have columns for the attributes of Customers,
i.e. Name and Address and phone. If the name and address are stored as just
two columns in your existing tables rather than as FirstName, LastName, and
several 'Address Line' columns you should create a unique index on the two
columns. However they are stored at present, though, you should initially
have the same columns in your new Customers table and index on the set of
name and address columns uniquely. Also give the table a CustomerID
autonumber column as its primary key. All you then need to do is insert rows
into the Customers table with an 'append' query based on your existing table
(or several append queries if the existing data is still in more than one
table. The indexing in the new Customers table will allow only one row per
customer to be inserted no matter how many times they might be duplicated in
the existing table(s).

To ensure that the existing records with phone numbers are inserted insert
those first by testing for Phone IS NOT NULL, e.g.

INSERT INTO Customers (Name, Address, Phone)
SELECT Name, Address, Phone
FROM ExistingTable
WHERE Phone IS NOT NULL;

Repeat this for whatever 'ExistingTables' you have. Then insert rows
without phone numbers:

INSERT INTO Customers (Name, Address, Phone)
SELECT Name, Address, Phone
FROM ExistingTable
WHERE Phone IS NULL;

Because of the unique indexing on Name and Address only one row will be
inserted by the first query for each name and address with a phone number.
Similarly the second query will insert only one row for each name and address
without a phone number where that name and address hasn't already been
inserted by the first query.

I imagine you can create the Categories table easily enough by hand as the
number of categories is presumably going to be relatively limited. This
should have a CategoryID autonumber primary key and a Category text column
with values like Blues, Jazz, Rock etc.

The CustomerInterests table will have columns CustomerID and CategoryID,
both of long integer number data type (not autonumber). Make both columns
the composite primary key of the table. To insert rows into this table you
need one 'append' query for each Category (which is why I said the process
was tedious but not difficult). So for Blues say the query would be:

INSERT INTO CustomerInterests(CustomerID, CategoryID)
SELECT CustomerID, CategoryID
FROM ExistingTable, Customers, Categories
WHERE Customers.CustomerName = ExistingTable.CustomerName
AND Customers.Address = ExistingTable.Address
AND Blues = TRUE
AND Category = "Blues"

In this query Blues is the name of the Boolean (Yes/No) column in
ExistingTable and is the value in the Category column of one row in the
categories table. Repeat this for each ExistingTable if you have more than
one. Only one row for each CustomerID/CategoryID combination will be
inserted by virtue of the unique indexing of the two columns in
CustomerInterests which is implicit in their being defined as the primary
key.

Run variations on this query for each category, simply changing the two
references to Blues to the category in question, in one case as the name of
the Boolean column in the ExistingTable and in the other as the value in the
Categories table.

Once you grasp the principles involved in this sort of thing its very easy
to implement it for a more complex schema involving a larger number of
tables. Fundamentally all it involves is first filling the tables which
represent the main entity types, Customers and Categories here, with distinct
values, then by joining them to the original table inserting rows to
represent the relationships between the main entity types.

In case you are wondering how the tables are joined in the last query
without it having any JOIN clause the answer is that the joins are done by
'join criteria' in the WHERE clause. This was the only way of doing it in
SQL until the JOIN operator was introduced in the SQL-92 standard. It was
introduced then to cater for OUTER JOINs, which can't be done in the WHERE
clause. You'll notice that there is no join criterion in relation to the
categories table. This is because only one row is being returned from that
table. Without any join criteria every row in each table is joined to every
row in the other(s), what is known as the Cartesian product of the tables.
As only one row is returned from categories here it returns rows in which the
CategoryID value for 'Blues' is returned along with each of CustomerID values
from Customers where the name and address match the name and address in those
rows of ExistingTable where the Boolean Blues column is True.

Ken Sheridan
Stafford, England
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top