Avoiding dupes when merging files

  • Thread starter Thread starter google_groups3
  • Start date Start date
G

google_groups3

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks
 
(e-mail address removed) wrote in @c13g2000cwb.googlegroups.com:
Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!


Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?
 
Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

Thanks for the fast reply. I have to use text files so that really is not
an option. Any pointers or some sample code on how to use the datatable? I
like the idea of being able to trap a dupicate OK error.

Bob
 
Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

I like the idea of the PK exception as it will give an error that i can
trap. I am being forced to use text files though for simplicity. Do you
have any sample code for implementing a datatable/PK exception as this is
new to me!

Bob
 
I like the idea of the PK exception as it will give an error that i
can trap. I am being forced to use text files though for simplicity.
Do you have any sample code for implementing a datatable/PK exception
as this is new to me!

Here's the example from MSDN:

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.
 
Thanks for this. But I guess i need something a little more basic. Also to
do it in memory or straight to disk. I guess i'll keep playing with the
loops
 
Thanks for the fast reply. I have to use text files so that really is
not an option. Any pointers or some sample code on how to use the
datatable? I like the idea of being able to trap a dupicate OK error.

I replied to your message a bit early in the day, but I'm not sure if
you received it:

Here's the example from MSDN (particularly the SetPrimaryKeys Sub):

http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.
 
Hi all.
I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......


Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()
 
Bob Hollness said:
OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......

As Cor suggested use a Hashtable, (or you might call it a Dictionary) it will
be much more efficient, and easier to code....

Paste the following in to a routine to see it in action:

HTH
LFS


Dim item As String
Dim hash As New System.Collections.Hashtable
Dim file1 As String() = New String() { _
"Pretend this is text from a file.", _
"It is contained in an array only for", _
"demo purposes."}
Dim file2 As String() = New String() { _
"This is the text from a second file.", _
"The next line is a duplicate line and", _
"will overwrite the original entry:", _
"It is contained (DUPLICATE)", _
"Only the first 10 characters", _
"were used toward duplicate testing."}

For Each item In file1
hash.Item(item.Substring(0, 10)) = item
Next

For Each item In file2
hash.Item(item.Substring(0, 10)) = item
Next

Dim entry As System.Collections.DictionaryEntry
For Each entry In hash
Debug.WriteLine(entry.Value)
Next

Debug.WriteLine("")
Debug.WriteLine("Note that the order is not maintained, and")
Debug.WriteLine("the duplicate line's original value was")
Debug.WriteLine("overwritten by the later (duplicate) entry.")
 
Bob Hollness said:
OK. This is the solution I came up with. Not as elegant as one would
have hoped. but then again, only I get to see how it functions under the
bonnet (hood for the Americans) !!! And of course, this is still to be
tidied up and made pretty. Feel free to pull it apart and embarrass
me.......


Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()

P.S. Yes I know that half the code is missing. It was late when I posted
this. I will update it with the missing parts this weekend.
 
Anon-E-Moose said:
Very inefficent when compared to Cor's elegant example of a hash table!

I thought hashing would work because when the hash is calculated for 2
identical strings, the hashes would be the same. So it was just a case of
comparing hashes. But someone else told me that this is not the case
because hashes are generated and not calculated, so the hashes would be
different. Is this not so?
 
Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

OK, you got me. I have not been thinking straight. I have since given it
further thought and the hash table is clearly the better way to go,
especially because of the file sizes i eventually will be using. So, i am
currently writing it all now (or just customising the samples placed
here......? ;-) )

thanks for the code samples Cor, as always, you have been helpful.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top