Remove Duplicate - Count Number of Unique Results

J

Jeff

I used OCR software to read the Names and addresses off ~500k images. For
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.

We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.

My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.

https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en
 
T

Tom van Stiphout

On Thu, 7 Jan 2010 19:14:01 -0800, Jeff

If you already have the correct accountno + name + address in some
table and are simply trying to match the document to the correct
record, perhaps a "fuzzy string matching" algorithm will help. I wrote
a chapter about one such algorithm in this book:
www.manning.com/nielsen/

To do what you suggested, you would get the count like this:
select count(*) from myTable
group by ID, Account

select count(*) from myTable
group by Name1, Street, City, State, Zip

-Tom.
Microsoft Access MVP
 
J

Jeff

Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff
 
T

Tom van Stiphout

On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP
 
J

Jeff

Tom

What Norhtwind Exapme DB is this from?

THansk
Jeff

Tom van Stiphout said:
On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP


Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff
.
 
T

Tom van Stiphout

On Fri, 8 Jan 2010 10:06:01 -0800, Jeff

I was using Access 2007.

-Tom.
Microsoft Access MVP

Tom

What Norhtwind Exapme DB is this from?

THansk
Jeff

Tom van Stiphout said:
On Fri, 8 Jan 2010 04:46:02 -0800, Jeff

Use subselects. Here is an example using the Northwind sample
database:
SELECT Customers.ID, Customers.[State/Province], (select
count([State/Province]) from Customers C where C.[State/Province] =
Customers.[State/Province] group by [State/Province]) AS StateCount
FROM Customers;

Note how the subselect's data is joined to the main select (in this
case using a where clause but you could also use a join).

-Tom.
Microsoft Access MVP


Tom

Thanks for the Quick response; this worked
How would I write this so the 2 counts could exists in the same Query?

Thanks
Jeff

:

On Thu, 7 Jan 2010 19:14:01 -0800, Jeff

If you already have the correct accountno + name + address in some
table and are simply trying to match the document to the correct
record, perhaps a "fuzzy string matching" algorithm will help. I wrote
a chapter about one such algorithm in this book:
www.manning.com/nielsen/

To do what you suggested, you would get the count like this:
select count(*) from myTable
group by ID, Account

select count(*) from myTable
group by Name1, Street, City, State, Zip

-Tom.
Microsoft Access MVP


I used OCR software to read the Names and addresses off ~500k images. For
each image we received ID and Account Number combinations that correspond to
a Name and Address. Within the data set there are several examples of each
ID, Account, Name and Address combination.

We are interested in removing the duplicates and only having the distinct
records. The software is not 100% accurate so the Names and Addresses can be
different from result to result for the same ID/Account.

My thought was to use a majority voting approach. Below is a link to an
example of the results. There are 19 records for the ID=900000023,
Account=123456789 with varied results. Some of the results within this
ID/Account group are duplicates. I was looking to add a field to output a
count of records for ID/Account group (19) and another column to output the
count of the Name and Address result string. I could then use this along
with the score to narrow down the correct result.

https://spreadsheets.google.com/ccc?key=0At39uG1JJzvCdFpfSl9LVEpJYVFTQ1JGeUIxTTd4V0E&hl=en
.
.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top