Avoiding dupes when merging files

google_groups3 · Nov 24, 2004

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

Lucas Tam · Nov 24, 2004

(e-mail address removed) wrote in @c13g2000cwb.googlegroups.com:

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

Bob Hollness · Nov 24, 2004

Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

Thanks for the fast reply. I have to use text files so that really is not
an option. Any pointers or some sample code on how to use the datatable? I
like the idea of being able to trap a dupicate OK error.

Bob

Bob Hollness · Nov 25, 2004

Take a look at the Microsoft Text Driver - you can run SQL queries on the

text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

I like the idea of the PK exception as it will give an error that i can
trap. I am being forced to use text files though for simplicity. Do you
have any sample code for implementing a datatable/PK exception as this is
new to me!

Bob

Lucas Tam · Nov 25, 2004

I like the idea of the PK exception as it will give an error that i
can trap. I am being forced to use text files though for simplicity.
Do you have any sample code for implementing a datatable/PK exception
as this is new to me!

Here's the example from MSDN:

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.

Bob Hollness · Nov 25, 2004

Thanks for this. But I guess i need something a little more basic. Also to
do it in memory or straight to disk. I guess i'll keep playing with the
loops

Lucas Tam · Nov 25, 2004

Thanks for the fast reply. I have to use text files so that really is
not an option. Any pointers or some sample code on how to use the
datatable? I like the idea of being able to trap a dupicate OK error.

I replied to your message a bit early in the day, but I'm not sure if
you received it:

Here's the example from MSDN (particularly the SetPrimaryKeys Sub):

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.

Bob Hollness · Nov 25, 2004

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......

Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()

Anon-E-Moose · Nov 26, 2004

Feel free to pull it apart and embarrass me.......

Very inefficent when compared to Cor's elegant example of a hash table!

Larry Serflaten · Nov 26, 2004

Bob Hollness said:
OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......

As Cor suggested use a Hashtable, (or you might call it a Dictionary) it will
be much more efficient, and easier to code....

Paste the following in to a routine to see it in action:

HTH
LFS

Dim item As String
Dim hash As New System.Collections.Hashtable
Dim file1 As String() = New String() { _
"Pretend this is text from a file.", _
"It is contained in an array only for", _
"demo purposes."}
Dim file2 As String() = New String() { _
"This is the text from a second file.", _
"The next line is a duplicate line and", _
"will overwrite the original entry:", _
"It is contained (DUPLICATE)", _
"Only the first 10 characters", _
"were used toward duplicate testing."}

For Each item In file1
hash.Item(item.Substring(0, 10)) = item
Next

For Each item In file2
hash.Item(item.Substring(0, 10)) = item
Next

Dim entry As System.Collections.DictionaryEntry
For Each entry In hash
Debug.WriteLine(entry.Value)
Next

Debug.WriteLine("")
Debug.WriteLine("Note that the order is not maintained, and")
Debug.WriteLine("the duplicate line's original value was")
Debug.WriteLine("overwritten by the later (duplicate) entry.")

Bob Hollness · Nov 26, 2004

Bob Hollness said:
OK. This is the solution I came up with. Not as elegant as one would
have hoped. but then again, only I get to see how it functions under the
bonnet (hood for the Americans) !!! And of course, this is still to be
tidied up and made pretty. Feel free to pull it apart and embarrass
me.......

Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()

P.S. Yes I know that half the code is missing. It was late when I posted
this. I will update it with the missing parts this weekend.

Bob Hollness · Nov 26, 2004

Anon-E-Moose said:
Very inefficent when compared to Cor's elegant example of a hash table!

I thought hashing would work because when the hash is calculated for 2
identical strings, the hashes would be the same. So it was just a case of
comparing hashes. But someone else told me that this is not the case
because hashes are generated and not calculated, so the hashes would be
different. Is this not so?

Bob Hollness · Nov 27, 2004

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

OK, you got me. I have not been thinking straight. I have since given it
further thought and the hash table is clearly the better way to go,
especially because of the file sizes i eventually will be using. So, i am
currently writing it all now (or just customising the samples placed
here......? ;-) )

thanks for the code samples Cor, as always, you have been helpful.

Avoiding duplicate lines?	11	Nov 24, 2004
How to merge teh lines of two text files?	2	Feb 19, 2009
Can anybody help me save some time ?	14	Dec 1, 2012
Get selected word documents from user's choice and merge those documents as a single doc file	0	Jul 26, 2015
Remove Duplicate with 'Caveat'	2	Dec 27, 2009
Mail Merge with Excel and Word 2007	1	Feb 20, 2014
Access Auto Matching Duplicates?	0	Jul 26, 2017
Utility needed: Merge two .pst files while avoiding duplicates..	1	Jan 3, 2007

Avoiding dupes when merging files

google_groups3

Lucas Tam

Bob Hollness

Bob Hollness

Lucas Tam

Bob Hollness

Lucas Tam

Bob Hollness

Anon-E-Moose

Larry Serflaten

Bob Hollness

Bob Hollness

Bob Hollness

Ask a Question

Similar Threads