Managing duplicates when INSERTing.

G

GPO

My instructions are: Use ADO and Access 2000 - and avoid docmd.

I have several thousand text files (aka extracts) of several thousand rows
each that I want to insert into a table. Most of the rows in the extracts
are unwanted duplicates in the sense that my definition of a primary key in
the table, can appear several times over the course of several extracts.
Will probably have to sift through about 6 million rows of text to arrive at
a final table of about 1.4 million "unique" rows.

The extracts are named in such a way that if I load them in descending order
of file name, the most recent records (the ones I want) would load first. I
want to set things up such that subsequent records simply don't load if they
are duplicating the keys of what is already loaded. I thought I had seen
this functionality somewhere in Access (?transfer text), where a paste
errors table is created and the non duplicated records load successfully.
Instead I get the "The changes you requested to the table were not
successful because they would create duplicate values..." message (and, well
thanks but I know that).

For example
Extract name: "A1.txt"
KeyPart1 Key Part2 Field1 Field2 ...
..
..
..
1234 456 ABC DEF ...
..
..
..

Extract name: "A2.txt"
KeyPart1 Key Part2 Field1 Field2 ...
..
..
..
1234 456 ABE DEG ...
..
..
..

In the above example the row in A2.txt would be loaded first, and is the
desired row.

Given the volume of the data, I obviously want to be able to load it with
the fewest passes through the data possible. Is there a way to load only the
non duplicated records in each extract (hopefully in a single pass without
nested subqueries)?

Here is a hacked up aircode version of the import function called in the
process of looping through all the extracts (apols for any wrapping):
Function ImportExtract( _
strExtractName As String, _
strLocalStore As String, _
strDestTable As String) As Boolean

Dim cmd As ADODB.Command
Dim strTextFileNameModified As String

On Error GoTo ErrorHandler

'1. Modify the file name by substituting the dot (.) in file name for a
"#".

strTextFileNameModified = Replace(strExtractName, ".", "#", 1, 1,
vbTextCompare)
'3. Create command object
Set cmd = New ADODB.Command
With cmd
.ActiveConnection = CurrentProject.Connection
.CommandType = adCmdText
'4. Use SQL as command text.
'SQL references text file directly (needs schema.ini).
.CommandText = "INSERT INTO " & strDestTable & " ( KeyPart1, Key
Part2 [etc] ) " & _
"SELECT KeyPart1, Key Part2 [etc] " & _
"FROM " & strTextFileNameModified & _
" IN " & """" & """" & " [Text; DATABASE=" & strLocalStore & ";]"
'5. Run the SQL
.Execute

End With
ImportExtract = True

ExitHandler:
Set cmd = Nothing
Exit Function
ErrorHandler:
ImportExtract = False
Goto ExitHandler
End Function

Many thanks, and sorry for the length of the post.
 
T

Tom Ellison

Dear GPO:

I recommend using a totally query based solution.

It is my policy to always have a "raw data" table into which every
proposed row is inserted first. In this table, I always have a column
"source" which identifies from where the data came. For your
situation, this would be the name of the text file, on which you could
sort descending so the desired row shows up first from each set,
sorting first by those columns that make up your unique key.

Assuming that you never have a duplicate key from any one text file,
you can proceed to write a "totals query" grouped by all your key
values giving the MAX() value of the Source column. By making an
INNER JOIN to your table of all the source rows, you could then
eliminate all duplicates, again assuming you do not have any
duplicates from within the same source text file.

Does that sound like a way to do it?

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts


My instructions are: Use ADO and Access 2000 - and avoid docmd.

I have several thousand text files (aka extracts) of several thousand rows
each that I want to insert into a table. Most of the rows in the extracts
are unwanted duplicates in the sense that my definition of a primary key in
the table, can appear several times over the course of several extracts.
Will probably have to sift through about 6 million rows of text to arrive at
a final table of about 1.4 million "unique" rows.

The extracts are named in such a way that if I load them in descending order
of file name, the most recent records (the ones I want) would load first. I
want to set things up such that subsequent records simply don't load if they
are duplicating the keys of what is already loaded. I thought I had seen
this functionality somewhere in Access (?transfer text), where a paste
errors table is created and the non duplicated records load successfully.
Instead I get the "The changes you requested to the table were not
successful because they would create duplicate values..." message (and, well
thanks but I know that).

For example
Extract name: "A1.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABC DEF ...
.
.
.

Extract name: "A2.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABE DEG ...
.
.
.

In the above example the row in A2.txt would be loaded first, and is the
desired row.

Given the volume of the data, I obviously want to be able to load it with
the fewest passes through the data possible. Is there a way to load only the
non duplicated records in each extract (hopefully in a single pass without
nested subqueries)?

Here is a hacked up aircode version of the import function called in the
process of looping through all the extracts (apols for any wrapping):
Function ImportExtract( _
strExtractName As String, _
strLocalStore As String, _
strDestTable As String) As Boolean

Dim cmd As ADODB.Command
Dim strTextFileNameModified As String

On Error GoTo ErrorHandler

'1. Modify the file name by substituting the dot (.) in file name for a
"#".

strTextFileNameModified = Replace(strExtractName, ".", "#", 1, 1,
vbTextCompare)
'3. Create command object
Set cmd = New ADODB.Command
With cmd
.ActiveConnection = CurrentProject.Connection
.CommandType = adCmdText
'4. Use SQL as command text.
'SQL references text file directly (needs schema.ini).
.CommandText = "INSERT INTO " & strDestTable & " ( KeyPart1, Key
Part2 [etc] ) " & _
"SELECT KeyPart1, Key Part2 [etc] " & _
"FROM " & strTextFileNameModified & _
" IN " & """" & """" & " [Text; DATABASE=" & strLocalStore & ";]"
'5. Run the SQL
.Execute

End With
ImportExtract = True

ExitHandler:
Set cmd = Nothing
Exit Function
ErrorHandler:
ImportExtract = False
Goto ExitHandler
End Function

Many thanks, and sorry for the length of the post.
 
G

GPO

Thanks Tom for your very quick response.

Yes, I've toyed with precisely this before, and agree that it does work for
moderate-sized chunks of data. I have a couple of observations though:
1. Are you suggesting that I will end up with a "raw" table AND a "final"
table, or a "raw" table and a "final"query?
2. If just raw, could I get around the speed issues of nested queries by
having a "void" flag in each row of the "raw" table set to yes for each
unwanted row? If so, is it possible to write an update query to do this
(noting point 3)?
3. One problem I've come across with the INNER JOIN between the query and
the raw table, is the "this recordset is not updateable" message when
attempting to delete or update from what would be the raw table. Am I right
in thinking that this only occurs when tables/queries don't have PKs
defined?
4. Because of the volume of data, I fear I may approach the 2gb limit of the
db (one reason why I was hoping to stop the data at the point of insertion -
especially given that 75% of the "raw" data is unwanted).
5. Interestingly I did a similar exercise using ADO to loop through every
row of a correctly ordered recordset, and compare each row with the previous
row, flagged the unwanted rows for deletion and processed a delete query.
Maybe I had my SQL badly set up, but ADO (or AD-slow as it is known) was
impressively faster than the SQL. Go figure.
6. Although it is a business rule that there are no duplicates within a
single text file, it does happen to about 3 in every 100,000 records.
Because we can point to a business rule that says "There can only be one"
(the highlander rule), we just arbitrarily pick one, and ignore the other.
This means though, importing the data into a "raw" table with a surrogate
key (autonumber), or no key at all (which brings you back to point three)...

Another approach I was thinking about was maybe a bit of script that worked
directly with the text files before importing them. It would have to:
1. Open a new text file for writing
2. Loop through the data files in descending filename order:
3. Within each data file, loop through each line and if the text string
preceding the second TAB (it's tab delimited with key data in the first two
fields) is not in the new file, in the equivalent location, copy that line
to the new file (concatenated to metadata info like originating filename).
4. Import just the desired rows, from the new text file.

Has anybody dealt with it this way before? Is it slow? or maybe even
SLOOOWWW?

Thanks again

GPO


Tom Ellison said:
Dear GPO:

I recommend using a totally query based solution.

It is my policy to always have a "raw data" table into which every
proposed row is inserted first. In this table, I always have a column
"source" which identifies from where the data came. For your
situation, this would be the name of the text file, on which you could
sort descending so the desired row shows up first from each set,
sorting first by those columns that make up your unique key.

Assuming that you never have a duplicate key from any one text file,
you can proceed to write a "totals query" grouped by all your key
values giving the MAX() value of the Source column. By making an
INNER JOIN to your table of all the source rows, you could then
eliminate all duplicates, again assuming you do not have any
duplicates from within the same source text file.

Does that sound like a way to do it?

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts


My instructions are: Use ADO and Access 2000 - and avoid docmd.

I have several thousand text files (aka extracts) of several thousand rows
each that I want to insert into a table. Most of the rows in the extracts
are unwanted duplicates in the sense that my definition of a primary key
in
the table, can appear several times over the course of several extracts.
Will probably have to sift through about 6 million rows of text to arrive
at
a final table of about 1.4 million "unique" rows.

The extracts are named in such a way that if I load them in descending
order
of file name, the most recent records (the ones I want) would load first.
I
want to set things up such that subsequent records simply don't load if
they
are duplicating the keys of what is already loaded. I thought I had seen
this functionality somewhere in Access (?transfer text), where a paste
errors table is created and the non duplicated records load successfully.
Instead I get the "The changes you requested to the table were not
successful because they would create duplicate values..." message (and,
well
thanks but I know that).

For example
Extract name: "A1.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABC DEF ...
.
.
.

Extract name: "A2.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABE DEG ...
.
.
.

In the above example the row in A2.txt would be loaded first, and is the
desired row.

Given the volume of the data, I obviously want to be able to load it with
the fewest passes through the data possible. Is there a way to load only
the
non duplicated records in each extract (hopefully in a single pass without
nested subqueries)?

Here is a hacked up aircode version of the import function called in the
process of looping through all the extracts (apols for any wrapping):
Function ImportExtract( _
strExtractName As String, _
strLocalStore As String, _
strDestTable As String) As Boolean

Dim cmd As ADODB.Command
Dim strTextFileNameModified As String

On Error GoTo ErrorHandler

'1. Modify the file name by substituting the dot (.) in file name for
a
"#".

strTextFileNameModified = Replace(strExtractName, ".", "#", 1, 1,
vbTextCompare)
'3. Create command object
Set cmd = New ADODB.Command
With cmd
.ActiveConnection = CurrentProject.Connection
.CommandType = adCmdText
'4. Use SQL as command text.
'SQL references text file directly (needs schema.ini).
.CommandText = "INSERT INTO " & strDestTable & " ( KeyPart1, Key
Part2 [etc] ) " & _
"SELECT KeyPart1, Key Part2 [etc] " & _
"FROM " & strTextFileNameModified & _
" IN " & """" & """" & " [Text; DATABASE=" & strLocalStore & ";]"
'5. Run the SQL
.Execute

End With
ImportExtract = True

ExitHandler:
Set cmd = Nothing
Exit Function
ErrorHandler:
ImportExtract = False
Goto ExitHandler
End Function

Many thanks, and sorry for the length of the post.
 
T

Tom Ellison

See inline, below.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

Thanks Tom for your very quick response.

Yes, I've toyed with precisely this before, and agree that it does work for
moderate-sized chunks of data. I have a couple of observations though:
1. Are you suggesting that I will end up with a "raw" table AND a "final"
table, or a "raw" table and a "final"query?

I would definitely plan to have a "final" table.
2. If just raw, could I get around the speed issues of nested queries by
having a "void" flag in each row of the "raw" table set to yes for each
unwanted row? If so, is it possible to write an update query to do this
(noting point 3)?
3. One problem I've come across with the INNER JOIN between the query and
the raw table, is the "this recordset is not updateable" message when
attempting to delete or update from what would be the raw table. Am I right
in thinking that this only occurs when tables/queries don't have PKs
defined?

Usually, and I believe specifically in this case, that can be overcome
with the "Allow Inconsistent Updates" option. However, I don't see
why you would ever mess with the "raw" table except to empty it for
the next round of imports.
4. Because of the volume of data, I fear I may approach the 2gb limit of the
db (one reason why I was hoping to stop the data at the point of insertion -
especially given that 75% of the "raw" data is unwanted).

Not an insignificant consideration, but making it much more difficult
to write isn't a good alternative, either. I suggest here that you
could put the "raw" table in another database completely. The
drawback to doing so is that you can't establish relationships between
this and other tables, and that would be of no concern anyway.
5. Interestingly I did a similar exercise using ADO to loop through every
row of a correctly ordered recordset, and compare each row with the previous
row, flagged the unwanted rows for deletion and processed a delete query.
Maybe I had my SQL badly set up, but ADO (or AD-slow as it is known) was
impressively faster than the SQL. Go figure.

That seems backwards to me, as well. The query should be the much
preferred method, for performance and for other reasons.
6. Although it is a business rule that there are no duplicates within a
single text file, it does happen to about 3 in every 100,000 records.
Because we can point to a business rule that says "There can only be one"
(the highlander rule), we just arbitrarily pick one, and ignore the other.
This means though, importing the data into a "raw" table with a surrogate
key (autonumber), or no key at all (which brings you back to point three)...

You can import without a PK, eliminate duplicates, then establish the
PK before proceeding.
Another approach I was thinking about was maybe a bit of script that worked
directly with the text files before importing them. It would have to:
1. Open a new text file for writing
2. Loop through the data files in descending filename order:
3. Within each data file, loop through each line and if the text string
preceding the second TAB (it's tab delimited with key data in the first two
fields) is not in the new file, in the equivalent location, copy that line
to the new file (concatenated to metadata info like originating filename).
4. Import just the desired rows, from the new text file.

This does not sound so good to me.
Has anybody dealt with it this way before? Is it slow? or maybe even
SLOOOWWW?

Thanks again

GPO


Tom Ellison said:
Dear GPO:

I recommend using a totally query based solution.

It is my policy to always have a "raw data" table into which every
proposed row is inserted first. In this table, I always have a column
"source" which identifies from where the data came. For your
situation, this would be the name of the text file, on which you could
sort descending so the desired row shows up first from each set,
sorting first by those columns that make up your unique key.

Assuming that you never have a duplicate key from any one text file,
you can proceed to write a "totals query" grouped by all your key
values giving the MAX() value of the Source column. By making an
INNER JOIN to your table of all the source rows, you could then
eliminate all duplicates, again assuming you do not have any
duplicates from within the same source text file.

Does that sound like a way to do it?

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts


My instructions are: Use ADO and Access 2000 - and avoid docmd.

I have several thousand text files (aka extracts) of several thousand rows
each that I want to insert into a table. Most of the rows in the extracts
are unwanted duplicates in the sense that my definition of a primary key
in
the table, can appear several times over the course of several extracts.
Will probably have to sift through about 6 million rows of text to arrive
at
a final table of about 1.4 million "unique" rows.

The extracts are named in such a way that if I load them in descending
order
of file name, the most recent records (the ones I want) would load first.
I
want to set things up such that subsequent records simply don't load if
they
are duplicating the keys of what is already loaded. I thought I had seen
this functionality somewhere in Access (?transfer text), where a paste
errors table is created and the non duplicated records load successfully.
Instead I get the "The changes you requested to the table were not
successful because they would create duplicate values..." message (and,
well
thanks but I know that).

For example
Extract name: "A1.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABC DEF ...
.
.
.

Extract name: "A2.txt"
KeyPart1 Key Part2 Field1 Field2 ...
.
.
.
1234 456 ABE DEG ...
.
.
.

In the above example the row in A2.txt would be loaded first, and is the
desired row.

Given the volume of the data, I obviously want to be able to load it with
the fewest passes through the data possible. Is there a way to load only
the
non duplicated records in each extract (hopefully in a single pass without
nested subqueries)?

Here is a hacked up aircode version of the import function called in the
process of looping through all the extracts (apols for any wrapping):
Function ImportExtract( _
strExtractName As String, _
strLocalStore As String, _
strDestTable As String) As Boolean

Dim cmd As ADODB.Command
Dim strTextFileNameModified As String

On Error GoTo ErrorHandler

'1. Modify the file name by substituting the dot (.) in file name for
a
"#".

strTextFileNameModified = Replace(strExtractName, ".", "#", 1, 1,
vbTextCompare)
'3. Create command object
Set cmd = New ADODB.Command
With cmd
.ActiveConnection = CurrentProject.Connection
.CommandType = adCmdText
'4. Use SQL as command text.
'SQL references text file directly (needs schema.ini).
.CommandText = "INSERT INTO " & strDestTable & " ( KeyPart1, Key
Part2 [etc] ) " & _
"SELECT KeyPart1, Key Part2 [etc] " & _
"FROM " & strTextFileNameModified & _
" IN " & """" & """" & " [Text; DATABASE=" & strLocalStore & ";]"
'5. Run the SQL
.Execute

End With
ImportExtract = True

ExitHandler:
Set cmd = Nothing
Exit Function
ErrorHandler:
ImportExtract = False
Goto ExitHandler
End Function

Many thanks, and sorry for the length of the post.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top