Identifying Duplicate Records

P

Peter Hallett

I have a dynamically created and amended table which occasionally contains
duplicate records. These duplicates must be allowed but need to be
flagged-up, when they occur, and marked in the table. Setting the No
Duplicates property is clearly inappropriate.

I can think of one or two ways of going about the job, including, perhaps
running two nearly identical queries, one with the unique records property
set and the other without, then comparing the results but the details of
subsequently identifying and marking the appropriate entries have started to
suggest some rather prolix VBA and I have a suspicion that I have missed
something obvious or, at least, simpler. Has anyone got any thoughts?
 
J

Jeff Boyce

Peter

You have stated that the "duplicates must be allowed".

What business need is being satisfied by having, say, two records that are
duplicates? That is, how will you (or Access, or your users) know which one
is the appropriate one to use in relation to other tables?

If there are some "distinguishing marks" that make the two records actually
not fully duplicated, couldn't you use those "marks" as part of your unique
index?

--
Regards

Jeff Boyce
www.InformationFutures.net

Microsoft Office/Access MVP


Microsoft IT Academy Program Mentor
http://microsoftitacademy.com/
 
D

Dirk Goldgar

Peter Hallett said:
I have a dynamically created and amended table which occasionally contains
duplicate records. These duplicates must be allowed but need to be
flagged-up, when they occur, and marked in the table. Setting the No
Duplicates property is clearly inappropriate.

I can think of one or two ways of going about the job, including, perhaps
running two nearly identical queries, one with the unique records property
set and the other without, then comparing the results but the details of
subsequently identifying and marking the appropriate entries have started
to
suggest some rather prolix VBA and I have a suspicion that I have missed
something obvious or, at least, simpler. Has anyone got any thoughts?


Here's an example of a query that updates a duplicate-flag field in a table
when records exists with a duplicate name:

UPDATE Addrbook SET HasDuplicate = True
WHERE Addrbook.Name In
(SELECT Addrbook.Name FROM Addrbook
GROUP BY Addrbook.Name
HAVING Count(*)>1);

Don't blame me for the bad field name "Name" -- the table was imported.
 
P

Peter Hallett

I left out some of the ‘unnecessary’ detail so as not to confuse the issue
but, since you ask, the situation is as follows:-

Work is commissioned by a major company, listing addresses at which that
work is to be carried out. The orders are indirectly input, on a
semi-automatic basis, into an Access database. The volume of orders is too
large to consider manual data entry and makes visual pre-assessment wholly
impractical.

Payment is made according to the number of addresses listed and NOT the
number of premises visited. Occasionally, the commissioning company includes
duplicates, sometimes in error but two or more jobs may have to be carried
out at the same premises. It is not appropriate, therefore, simply to reject
the duplicates unconditionally . Apart from anything else, this changes the
number of jobs commissioned and hence both the rate of payment and the total
payment due. It is up to the receiving agent to deal with the duplicates
appropriately, in some cases by removing them but, more often, by
ascertaining the nature of the situation, usually by telephoning the
commissioning company. In any event, the number of jobs invoiced will not
generally match the number ordered which is not acceptable to the
commissioning company without its prior consent. The situation is therefore
far from simple. The duplicates must therefore remain and be shown. Sorting
out the repercussions is a job for human beings rather than computers.

As to whether the duplicates are identical the answer is that, apart from
their order in the list, they generally are. However, this is not a
fundamental issue. If duplicates exist, they have to be shown in order to be
subsequently actioned. I hope that this clarifies the situation. I trust
you will agree that the problem cannot be resolved before it has to be dealt
with.
 
G

gllincoln

Hi Peter,

The first order of business would be to determine what column of data is
unique except when it isn't? <grin>

Phone number mught be a good one, particularly if you force a uniform format
on the phone numbers entered. Address is what you mentioned but nailing down
dupe addresses isn't necessarily straightforward. Unless you are running
your addresses through an address correction/validating scrub, accurately
finding all of the duplicate street addresses can get complicated.

Managing the address data for a couple bulk mail campaigns makes one cynical
regarding their fellow man's ability to accurately and/or consistently enter
a postal address into a form; maybe I'm a little prejudiced on this subject.
However, you might need to concatenate two or more columns into a single
entity to get a psuedo primary key (for querying purposes), to double-check
things. For example, if first and last names are two columns - you might
use FullName:[firstname] & " " & [lastname] in the query designer or '" &
[firstname] & ' ' & [lastname] & "' AS FullName in SQL.

Whatever you use, you then can use the results to flag the dupes for review
in one way or another, or bring up the dupe records to look at them.

The traditional dupe finder query looks be something like this:

SELECT * FROM mytable WHERE address=IN(SELECT address FROM mytable AS tmp
GROUP BY address HAVING Count(*)>1;);

This is a tricky little query to understand, from a beginner to intermediate
point of view. I didn't come up with it on my own - Access 95/97 used to
have a find duplicate records wizard. Deduping a set of records is such a
common chore that I soon had that criteria phrase memorized.

If you are using the query designer - you would put

=IN(SELECT address FROM mytable AS tmp GROUP BY address HAVING Count(*)>1;)

in the criteria box of the address column.

For those who don't quite get it (how this works) don't feel badly. I
struggled wth this one. Just had a mental block at first - I saw that it
worked, I memorized it character by character and used it. Later, when I
started using an occasional aggregating query and got stuck doing some
elaborate reports, I began to understand GROUP BY and then 'I got it'.

A quick overview for the slow learners like me.

Whatever you put inside the IN( ) (and we can have more than one item
separated by commas) will be compared to the left-hand side and return true
or false if we get a match or not.

5= IN(1,2,3) will be false, no match
5= IN(3,4,5) will be true, we have a match

To find our duplicate records we build a query inside the IN function that
will return only those values that occur more than once in the table.

In plain English what our query is saying, give us all the addresses (SELECT
address AS tmp) where we get more than one item (HAVING Count(*)>1) that is
the same when we group them together (GROUP BY).

This query inside the IN - returns a list similar to our IN example above.
When the query is running, each row's address column is being compared using
the IN function to an array of dupes only addresses (dupeaddress1,
dupeaddress2,dupeaddress3, etc)

Hope this helps...

Gordon
 
P

Peter Hallett

Thank you, Gordon, and everyone else who so kindly responded. As it happens,
however, the solution, as I suspected, was staring me in the face. Having
not quite got the answer I wanted, I spent another couple of hours with
Microsoft Access Help (never the most helpful of services imho!). Finally,
and not for the first time, I fell over what I was looking for in the
response to what appeared to be a barely relevant question. (I was trying
anything that appeared even vaguely related at that stage.)

For the benefit of anyone who, like me, has been looking at the thing for
years, without seeing it, just go to the database window, select Queries,
click New and there, in the list of options is Find Duplicates query. How on
earth I missed it is a mystery but I have given it a whirl and it works just
fine, faithfully listing the duplicated records in the required table. My
only satisfactions are that at least I did not have to wait for someone else
to point out that I was being dumb and I now have a solution where it was
becoming increasingly apparent that it was going to be very difficult to find
a suitable work-around.

--
Peter Hallett


gllincoln said:
Hi Peter,

The first order of business would be to determine what column of data is
unique except when it isn't? <grin>

Phone number mught be a good one, particularly if you force a uniform format
on the phone numbers entered. Address is what you mentioned but nailing down
dupe addresses isn't necessarily straightforward. Unless you are running
your addresses through an address correction/validating scrub, accurately
finding all of the duplicate street addresses can get complicated.

Managing the address data for a couple bulk mail campaigns makes one cynical
regarding their fellow man's ability to accurately and/or consistently enter
a postal address into a form; maybe I'm a little prejudiced on this subject.
However, you might need to concatenate two or more columns into a single
entity to get a psuedo primary key (for querying purposes), to double-check
things. For example, if first and last names are two columns - you might
use FullName:[firstname] & " " & [lastname] in the query designer or '" &
[firstname] & ' ' & [lastname] & "' AS FullName in SQL.

Whatever you use, you then can use the results to flag the dupes for review
in one way or another, or bring up the dupe records to look at them.

The traditional dupe finder query looks be something like this:

SELECT * FROM mytable WHERE address=IN(SELECT address FROM mytable AS tmp
GROUP BY address HAVING Count(*)>1;);

This is a tricky little query to understand, from a beginner to intermediate
point of view. I didn't come up with it on my own - Access 95/97 used to
have a find duplicate records wizard. Deduping a set of records is such a
common chore that I soon had that criteria phrase memorized.

If you are using the query designer - you would put

=IN(SELECT address FROM mytable AS tmp GROUP BY address HAVING Count(*)>1;)

in the criteria box of the address column.

For those who don't quite get it (how this works) don't feel badly. I
struggled wth this one. Just had a mental block at first - I saw that it
worked, I memorized it character by character and used it. Later, when I
started using an occasional aggregating query and got stuck doing some
elaborate reports, I began to understand GROUP BY and then 'I got it'.

A quick overview for the slow learners like me.

Whatever you put inside the IN( ) (and we can have more than one item
separated by commas) will be compared to the left-hand side and return true
or false if we get a match or not.

5= IN(1,2,3) will be false, no match
5= IN(3,4,5) will be true, we have a match

To find our duplicate records we build a query inside the IN function that
will return only those values that occur more than once in the table.

In plain English what our query is saying, give us all the addresses (SELECT
address AS tmp) where we get more than one item (HAVING Count(*)>1) that is
the same when we group them together (GROUP BY).

This query inside the IN - returns a list similar to our IN example above.
When the query is running, each row's address column is being compared using
the IN function to an array of dupes only addresses (dupeaddress1,
dupeaddress2,dupeaddress3, etc)

Hope this helps...

Gordon




















Peter Hallett said:
I have a dynamically created and amended table which occasionally contains
duplicate records. These duplicates must be allowed but need to be
flagged-up, when they occur, and marked in the table. Setting the No
Duplicates property is clearly inappropriate.

I can think of one or two ways of going about the job, including, perhaps
running two nearly identical queries, one with the unique records property
set and the other without, then comparing the results but the details of
subsequently identifying and marking the appropriate entries have started
to
suggest some rather prolix VBA and I have a suspicion that I have missed
something obvious or, at least, simpler. Has anyone got any thoughts?
 
P

Peter Hallett

Gordon,

As a postscript to my earlier reply, examination of the SQL statement you so
kindly analysed for me shows that it is indeed that used in the Find
Duplicate query.

I am very grateful for your explanation. It certainly clarifies what seems
a pretty arcane piece of code to someone who tries to avoid writing SQL
wherever possible. One day I must get around to learning it but, in the mean
time, I try to use VBA in what I term ‘high-level mode’. It is amazing what
can be achieved without ever defining a record set or writing SQL statements.
Indeed, the duplicate records problem represented one of the few occasions
on which I have been truly stumped over the last few years – and I have had
to deal with some pretty complex situations during that time. Discovering
the Find Duplicate query, however, has allowed me to plug another gap in my
knowledge and further enhanced my respect for Access. In the personal and
small business environment there is certainly very little that it can’t do.

It was clearly not a case of ‘two minds with but a single thought’ – mine
was just stubborn, and rather frustrating, slog – but we seem to have arrived
at the same solution by different routes. Meanwhile, I have had the chance
to study your response in detail and I have filed a printed copy for later
reference, together with the relevant sections of the other replies. Thank
you all, once again.

--
Peter Hallett


gllincoln said:
Hi Peter,

The first order of business would be to determine what column of data is
unique except when it isn't? <grin>

Phone number mught be a good one, particularly if you force a uniform format
on the phone numbers entered. Address is what you mentioned but nailing down
dupe addresses isn't necessarily straightforward. Unless you are running
your addresses through an address correction/validating scrub, accurately
finding all of the duplicate street addresses can get complicated.

Managing the address data for a couple bulk mail campaigns makes one cynical
regarding their fellow man's ability to accurately and/or consistently enter
a postal address into a form; maybe I'm a little prejudiced on this subject.
However, you might need to concatenate two or more columns into a single
entity to get a psuedo primary key (for querying purposes), to double-check
things. For example, if first and last names are two columns - you might
use FullName:[firstname] & " " & [lastname] in the query designer or '" &
[firstname] & ' ' & [lastname] & "' AS FullName in SQL.

Whatever you use, you then can use the results to flag the dupes for review
in one way or another, or bring up the dupe records to look at them.

The traditional dupe finder query looks be something like this:

SELECT * FROM mytable WHERE address=IN(SELECT address FROM mytable AS tmp
GROUP BY address HAVING Count(*)>1;);

This is a tricky little query to understand, from a beginner to intermediate
point of view. I didn't come up with it on my own - Access 95/97 used to
have a find duplicate records wizard. Deduping a set of records is such a
common chore that I soon had that criteria phrase memorized.

If you are using the query designer - you would put

=IN(SELECT address FROM mytable AS tmp GROUP BY address HAVING Count(*)>1;)

in the criteria box of the address column.

For those who don't quite get it (how this works) don't feel badly. I
struggled wth this one. Just had a mental block at first - I saw that it
worked, I memorized it character by character and used it. Later, when I
started using an occasional aggregating query and got stuck doing some
elaborate reports, I began to understand GROUP BY and then 'I got it'.

A quick overview for the slow learners like me.

Whatever you put inside the IN( ) (and we can have more than one item
separated by commas) will be compared to the left-hand side and return true
or false if we get a match or not.

5= IN(1,2,3) will be false, no match
5= IN(3,4,5) will be true, we have a match

To find our duplicate records we build a query inside the IN function that
will return only those values that occur more than once in the table.

In plain English what our query is saying, give us all the addresses (SELECT
address AS tmp) where we get more than one item (HAVING Count(*)>1) that is
the same when we group them together (GROUP BY).

This query inside the IN - returns a list similar to our IN example above.
When the query is running, each row's address column is being compared using
the IN function to an array of dupes only addresses (dupeaddress1,
dupeaddress2,dupeaddress3, etc)

Hope this helps...

Gordon




















Peter Hallett said:
I have a dynamically created and amended table which occasionally contains
duplicate records. These duplicates must be allowed but need to be
flagged-up, when they occur, and marked in the table. Setting the No
Duplicates property is clearly inappropriate.

I can think of one or two ways of going about the job, including, perhaps
running two nearly identical queries, one with the unique records property
set and the other without, then comparing the results but the details of
subsequently identifying and marking the appropriate entries have started
to
suggest some rather prolix VBA and I have a suspicion that I have missed
something obvious or, at least, simpler. Has anyone got any thoughts?
 
P

Peter Hallett

Gordon,
As a postscript to my earlier reply, examination of the SQL statement you so
kindly analysed for me shows that it is indeed that used in the Find
Duplicate query. I am very grateful for your explanation. It certainly
clarifies what seems a pretty arcane piece of code to someone, like me, who
tries to avoid writing SQL wherever possible. One day I must get around to
learning it but, in the mean time, I try to use VBA in what I term
‘high-level mode’. It is amazing what can be achieved without ever defining
a record set or writing SQL statements. Indeed, the duplicate records
problem represented one of the few occasions on which I have been truly
stumped over the last few years – and I have had to deal with some pretty
complex situations during that time. Discovering the Find Duplicate query,
however, has allowed me to plug another gap in my knowledge and further
enhanced my respect for Access. In the personal and small business
environment there is certainly very little that it can’t do.

It was clearly not a case of ‘two minds with but a single thought’ – mine
was just stubborn, and rather frustrating, slog – but we seem to have arrived
at the same solution by different routes. Meanwhile, I have had the chance
to study your response in detail and I have filed a printed copy for later
reference, together with the relevant sections of the other replies. Thank
you all, once again.

--
Peter Hallett


gllincoln said:
Hi Peter,

The first order of business would be to determine what column of data is
unique except when it isn't? <grin>

Phone number mught be a good one, particularly if you force a uniform format
on the phone numbers entered. Address is what you mentioned but nailing down
dupe addresses isn't necessarily straightforward. Unless you are running
your addresses through an address correction/validating scrub, accurately
finding all of the duplicate street addresses can get complicated.

Managing the address data for a couple bulk mail campaigns makes one cynical
regarding their fellow man's ability to accurately and/or consistently enter
a postal address into a form; maybe I'm a little prejudiced on this subject.
However, you might need to concatenate two or more columns into a single
entity to get a psuedo primary key (for querying purposes), to double-check
things. For example, if first and last names are two columns - you might
use FullName:[firstname] & " " & [lastname] in the query designer or '" &
[firstname] & ' ' & [lastname] & "' AS FullName in SQL.

Whatever you use, you then can use the results to flag the dupes for review
in one way or another, or bring up the dupe records to look at them.

The traditional dupe finder query looks be something like this:

SELECT * FROM mytable WHERE address=IN(SELECT address FROM mytable AS tmp
GROUP BY address HAVING Count(*)>1;);

This is a tricky little query to understand, from a beginner to intermediate
point of view. I didn't come up with it on my own - Access 95/97 used to
have a find duplicate records wizard. Deduping a set of records is such a
common chore that I soon had that criteria phrase memorized.

If you are using the query designer - you would put

=IN(SELECT address FROM mytable AS tmp GROUP BY address HAVING Count(*)>1;)

in the criteria box of the address column.

For those who don't quite get it (how this works) don't feel badly. I
struggled wth this one. Just had a mental block at first - I saw that it
worked, I memorized it character by character and used it. Later, when I
started using an occasional aggregating query and got stuck doing some
elaborate reports, I began to understand GROUP BY and then 'I got it'.

A quick overview for the slow learners like me.

Whatever you put inside the IN( ) (and we can have more than one item
separated by commas) will be compared to the left-hand side and return true
or false if we get a match or not.

5= IN(1,2,3) will be false, no match
5= IN(3,4,5) will be true, we have a match

To find our duplicate records we build a query inside the IN function that
will return only those values that occur more than once in the table.

In plain English what our query is saying, give us all the addresses (SELECT
address AS tmp) where we get more than one item (HAVING Count(*)>1) that is
the same when we group them together (GROUP BY).

This query inside the IN - returns a list similar to our IN example above.
When the query is running, each row's address column is being compared using
the IN function to an array of dupes only addresses (dupeaddress1,
dupeaddress2,dupeaddress3, etc)

Hope this helps...

Gordon




















Peter Hallett said:
I have a dynamically created and amended table which occasionally contains
duplicate records. These duplicates must be allowed but need to be
flagged-up, when they occur, and marked in the table. Setting the No
Duplicates property is clearly inappropriate.

I can think of one or two ways of going about the job, including, perhaps
running two nearly identical queries, one with the unique records property
set and the other without, then comparing the results but the details of
subsequently identifying and marking the appropriate entries have started
to
suggest some rather prolix VBA and I have a suspicion that I have missed
something obvious or, at least, simpler. Has anyone got any thoughts?
 
J

Jeff Boyce

Peter

I may be reading too much into your description...

It sounds like an "address listed" is your key to determining the payment.

A few thoughts...

First, if your table into which the data is entered (either via data entry
or an append process) has a unique key (take a look at Access HELP on the
Autonumber field), it won't matter if more than one "address" is identical.

Next, if your process is NOT using an address table, the following addresses
may, in fact, be the same location, but Access would NEVER consider them
duplicates without extensive USB (using someone's brain)<g>:

12345 Elm St.

12345 Elm Street

12345 Elm St SW

Lastly, as you and others have responded, you can use the Find Duplicates
query wizard to help you locate exact duplicates (with the caveat from the
previous comment).

Good luck!

--
Regards

Jeff Boyce
www.InformationFutures.net

Microsoft Office/Access MVP


Microsoft IT Academy Program Mentor
http://microsoftitacademy.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top