Deduped Field not working?

C

connorslack

hey All,

I have a table with 110,453 people/records. When I run a dedupe SQL
statement on the "emailID" field, i get 90,167 unique records, which
is perfect, and it also puts beside each unique email a "count" of how
many times that particular record appears in the database.

My problem arises when i add a second field, lets say "newsletter"
this filed denotes if the user has opted in to receive a newsletter
("yes" or "no"). My problem is that when i add this field in, the
query no longer becomes deduped because for those users who entered
more than once have put Yes for one entry, and No for another, there
fore rendering my dedupe query useless for that particular record and
so forth.

I have tried to put in the "total" column in the query editor "first"
to pick the very first record of each email, but what happens if the
very first record for newsletter for a particular email is "no" but
the second record for the record(assuming this is a deduped record we
have come to) is Yes? if i run a query for all Unique emails with
newsletter = yes, my query wont be right.

Thanks all.
 
J

John W. Vinson

hey All,

I have a table with 110,453 people/records. When I run a dedupe SQL
statement on the "emailID" field, i get 90,167 unique records, which
is perfect, and it also puts beside each unique email a "count" of how
many times that particular record appears in the database.

My problem arises when i add a second field, lets say "newsletter"
this filed denotes if the user has opted in to receive a newsletter
("yes" or "no"). My problem is that when i add this field in, the
query no longer becomes deduped because for those users who entered
more than once have put Yes for one entry, and No for another, there
fore rendering my dedupe query useless for that particular record and
so forth.

I have tried to put in the "total" column in the query editor "first"
to pick the very first record of each email, but what happens if the
very first record for newsletter for a particular email is "no" but
the second record for the record(assuming this is a deduped record we
have come to) is Yes? if i run a query for all Unique emails with
newsletter = yes, my query wont be right.

Thanks all.

If you want to find all email addresses for which there is a Yes in any one of
the repeats, use the Where "totals" operator for the newsletter field, and put
a criterion of Yes on it. This will give you only the Yes responses.

If you want to see all the email addresses, regardless of the values in the
yes/no field, use Min on those fields - since Yes is stored as -1, No as 0
you'll see Yes if there is any Yes answer for that address, No if they
consistantly said No.
 
K

Ken Sheridan

The problem is that your table is not fully normalized. Normalization, which
eliminates redundancy in a table, is based on 'functional dependencies'. A
column is functionally dependent on another column (or columns) where for
every occurrence of a value of the latter column in the database the value of
the former column is the same, e.g. in a table People with columns PersonID
(the key), FirstName and LastName if I have a PersonID of 42 then wherever
the PersonID value of 42 crops up in a database the first and last names can
only be Ken and Sheridan. FirstName, LastName are functionally dependent on
PersonID and the Person table is correctly normalized.

A table is normalized to Third Normal Form, 3NF (there are others beyond
this, but we don't need to consider them just now) if every non-key column is
functionally dependent solely on the whole of the table's primary key. So if
another table included columns PersonID (not the key in this case),
FirstName, LastName and PhoneNumber it would need a number of rows for me to
handle my various phone numbers. In this table FirstName, LastName are still
functionally dependent on PersonID, but as this is not the whole of the key
(it could be part of a multi-column key) the table is not normalized to 3NF.

In your case the newsletter column is similarly functionally dependent on
whatever your equivalent of PersonID is, but as there are multiple rows per
person to cope with one person having more than one emailID whatever your
equivalent of PersonID is cannot be the table's primary key.

If you already have a 'People' table, i.e. a table with one row per person
to which your table with multiple rows per person is related, then you can
simply put the newsletter column in that table, where it will be correctly
functionally dependent on the key. If you don't have such a table, then you
should create one and include columns which represent the attributes of a
person in it, FirstName, LastName, DateOfbirth and the like, giving it a
numeric PersonID primary key such as an autonumber. Its easy to fill such a
table with an 'append' query which appends SELECT DISTINCT FirstName,
LastName etc from your current table. Then add a foreign key PersonID (Not
an autonumber this time) to your current table and update this to the
PersonID from the People table by joining the two tables on whatever
combination of columns can be used to identify each unique person. You can
then delete the redundant columns from your current table.

Quite which values you'd put in the newsletter column in the People table
where they have inconsistent values for the same person in your current table
is for you to decide. My natural inclination would be to opt for 'yes' where
there is at least one 'yes' per person in the current table, and 'no' where
there is no 'yes' for a person. Again this is easily done with an update
query.

As far as the duplicated emailID values in your current table are concerned
you can prevent this by creating a unique index on the PersonID and emailIID
columns (indexing the columns in combination, not individually).

Ken Sheridan
Stafford, England
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top