Reading large csv-file and removing duplicates

R

Robertico

I've large csv-file (4 Gb) with a lot of duplicates. The csv-files contains
two fields.
If there is a duplicate, is that both fields.
So i need to read this large csv-file, remove the duplicates and write a new
csv-file with only the unique values.
I used StreamReader and ReadLine to read the csv-file and Split to split the
fields

After reading i added the line to a HashSet

HashSet<string> DataRows = new HashSet<string>();

Check for duplicates (only first field for saving memory) before write to
new csv-file.

bool duplicate = DataRows.Add(DataColumn[0]);

But everytime an exception of type 'System.OutOfMemoryException' was thrown.

Any advice is welcome.


Thanks in advance,

Robertico
 
A

Alberto Poblacion

Robertico said:
I've large csv-file (4 Gb) with a lot of duplicates.
[...] I used StreamReader and ReadLine to read the csv-file [...] But
everytime an exception of type
'System.OutOfMemoryException' was thrown.

Reading a 4GB text file into memory is likely to require 8 GB of memory
(plus overhead for the data structures), since .Net stores the strings as
Unicode, requiring 2 bytes for each character. In order to address that much
memory, you would need to run a 64-bit version of your program, and of
course make that amount memory available to it (even if it is virtual
memory).

If you don't have the memory resources, you will need to resort to using
the hard disk as a temporary storage to classify the data as you are
searching for duplicates. One simple way to do this would be to import the
data into a database, and then let the database do the work of sorting out
the duplicates for you. Otherwise, if you want to do the work yourself, you
will need to sort the data on disk. There are various ways to do this, for
instance, read about Merge Sort.
 
C

CY

I've large csv-file (4 Gb) with a lot of duplicates. The csv-files contains
two fields.
If there is a duplicate, is that both fields.
So i need to read this large csv-file, remove the duplicates and write a new
csv-file with only the unique values.
I used StreamReader and ReadLine to read the csv-file and Split to split the
fields
.... and so on

IF you can read the whole thing into memory then why not do a loop and
delete all exept one of occurances, on the other hand
if you are lazy it can be done using "any" database just import to a
table using both fields as a combined key and not allowing duplicates
1NF (Minimal Form), a constraint, would do it.

//CY
 
R

Robertico

Reading a 4GB text file into memory is likely to require 8 GB of memory
(plus overhead for the data structures), since .Net stores the strings as
Unicode, requiring 2 bytes for each character. In order to address that
much memory, you would need to run a 64-bit version of your program, and
of course make that amount memory available to it (even if it is virtual
memory).

I've only 2Gb RAM available and my OS is 32Bit.
If you don't have the memory resources, you will need to resort to
using the hard disk as a temporary storage to classify the data as you are
searching for duplicates. One simple way to do this would be to import the
data into a database, and then let the database do the work of sorting out
the duplicates for you. ..

I tried to import the data in a Ms Access 2007 database, but the file is to
large :))
Otherwise, if you want to do the work yourself, you will need to sort the
data on disk. There are various ways to do this, for instance, read about
Merge Sort.

Can you be clearer about this. It sounds interesting.

Regards,

Robertico.
 
R

Robertico

...Create say ten smaller files by looking at the first
letter of the index entry on each line. Pur A - B in one file, C - D
in the next file and so on. Then you can remove duplicates from each
file separately and merge the de-duplicated files again at the end.

The file is not sorted.

Robertico
 
A

Alberto Poblacion

Robertico said:
I've only 2Gb RAM available and my OS is 32Bit.
[...]
I tried to import the data in a Ms Access 2007 database, but the file is
to large :))

Yes, Access has a file size limitation of 2GB. If you want to go the
database route, you will need something bigger. Ordinarily, I would
recommend Sql Server Express, but unfortunately the Express version is
limited to 4GB per database, so it will still not serve you.
[...] Can you be clearer about this. It sounds interesting.

One way in which you could remove duplicates without reading all the
data into memory is this:

Loop on the file on disk reading all the rows one by one. For each row do
the following:
- Compute a hash value on the fields that you want to be unique (the
GetHashCode method should serve you well).
- Use the hash code to store in a hashtable the offset in the file where the
record that you read initially was located. This hashtable will probably be
small enough to fit into your available RAM, since it only stores a numeric
value and not your whole records.
- For each record, first look it up in the hash table. If it exists there,
Seek to the offset in the file where it is stored, read the data and compare
them to the record you are currently processing. If they match, then the
current record is a duplicate and it can be discarded. Otherwise, continue
comparing the rest of the records that have the same hasch code, since there
could be more than one. After you have compared them all, if none match,
save the record hash and offset to the internal table and save the complete
record into an output file.
 
A

Alberto Poblacion

Robertico said:
The file is not sorted.

Once you have split the file into smaller files, so that one of those
pieces fits in memory, you can easily read it into a SortedList, so that it
gets sorted in memory. You then dump it into disk again.
Then you use Merge Sort on the small files. Basically, you open two
files and read the first record of each one. You compare both records and
save the smallest one into the output file. You then read the next record
from the file that contained the one that you wrote, and repeat the
operation. Of course, in this process you omit the duplicates. This
mechanism can easily be extended to more than two files, if needed.
 
C

CY

I tried to import the data in a Ms Access 2007 database, but the file is to
large :))

Robertico.

Use NTFS and MySQL (still free? dont know), rember a ISAM text (or
blob) filled only db will need to be configured a bit, the value of
myisam_data_pointer_size can be from 2 to 7. A value of 4 allows
tables up to 4GB; a value of 6 allows tables up to 256TB. MS Access
might work for your private CD/DVD collection, but not for handling
more data.

There should be a number of C# examples for reading/writing the
database out there.

//CY
 
R

Robertico

Use NTFS and MySQL (still free? dont know), rember a ISAM text (or
blob) filled only db will need to be configured a bit, the value of
myisam_data_pointer_size can be from 2 to 7. A value of 4 allows
tables up to 4GB; a value of 6 allows tables up to 256TB. MS Access
might work for your private CD/DVD collection, but not for handling
more data.

Running WAMP (Apache, php and MySQL) for Windows. (myisam_data_pointer_size
value=6 )
But i'am not familiar with MySQL. How can i import my csv-file?

Thanks in advance,

Robertico
 
C

CY

Running WAMP (Apache, php and MySQL) for Windows. (myisam_data_pointer_size
value=6 )
But i'am not familiar with MySQL. How can i import my csv-file?

Thanks in advance,

Robertico

Use a c# program to insert and extract the data, or load it..

LOAD DATA LOCAL INFILE '/importfile.csv'
INTO TABLE test_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(field1, filed2, field3);

is another way, it is more fun to make the C# program, not so fun to
just load the data.

//CY
 
P

Peter Duniho

Alberto said:
[...]
Loop on the file on disk reading all the rows one by one. For each row
do the following:
- Compute a hash value on the fields that you want to be unique (the
GetHashCode method should serve you well).
- Use the hash code to store in a hashtable the offset in the file where
the record that you read initially was located. This hashtable will
probably be small enough to fit into your available RAM, since it only
stores a numeric value and not your whole records.
- For each record, first look it up in the hash table. If it exists
there, Seek to the offset in the file where it is stored, read the data
and compare them to the record you are currently processing. If they
match, then the current record is a duplicate and it can be discarded.
Otherwise, continue comparing the rest of the records that have the same
hasch code, since there could be more than one. After you have compared
them all, if none match, save the record hash and offset to the internal
table and save the complete record into an output file.

IMHO, this suggestion is one of the best offered so far. It's
efficient, reasonably simple to implement, and offloads _all_ of the
actual key data to the file system (minus the 4 or 8 bytes offset into
the file, of course). Of course, it adds the cost of a Dictionary
(preferable) or Hashtable (key and value), as compared to a HashSet (key
only) so will work best with relatively long keys.

The sorting ones aren't bad either, but of course they get complicated
if the output needs to remain in the same order as the input (you have
to add a new sort-order field to the intermediate files, and then
re-sort the output once the duplicates have been removed; depending on
the reduction of size of input, this could mean yet again partitioning
the problem to sort segments of the output and then doing a merge sort
to reassemble everything).

I don't usually suggest using a database for random file manipulation
questions, but I think in this case that is also a good suggestion, just
due to the sheer simplicity of it. Access and SQL Express aren't the
only databases out there, and as "CY" points out, MySQL can be
configured to support larger databases, large enough to support your input.

Of course, the easiest solution would be to switch to a 64-bit OS. Then
you can allocate larger amounts of memory in your process, and the error
would go away. :) (With only 2GB of RAM on the computer, you'll
certainly run into disk swapping performance issues, but all the
fallback solutions will rely on some kind of disk i/o anyway, so
performance is going to suffer one way or the other using those techniques).

Another option you might try is to store your column values more
efficiently. Rather than using a HashSet<string>, use a HashSet<byte[]>
and store the raw bytes from the file rather than the UTF-16 of a
System.String instance (or re-encode them as ASCII or UTF-8, if that
works better for you). If you know that the keys contain only
characters that are in a limited set of characters, you could even
compress the bytes further, either just by reducing the
bits-per-character (treat the byte array as a bit-stream), or by using
something like Huffman encoding (simple to implement and works well on
ASCII text).

In fact, depending on the data requirements for a file offset (probably
8 bytes if your file is 4GB or larger, though you could get complicated
and store fewer than 8 bytes, throwing out the most significant bytes
that you know are always 0 for the size of the file you're dealing
with), and the actual length of a key being hashed, this approach could
work as well as or better than the "hash, storing offsets to file" approach.

Finally, you might try writing your own hash data structure, and
including the ability to set in advance the exact size of the hash
table, and maybe even pre-allocate a pool of entry data structures for
the table.

Setting the size of the table in advance could be very helpful, because
at least part of your error is likely to be caused by the fact that
HashSet has to reallocate its internal data structures as it grows, and
each time that happens, you have two copies of the data in memory at
once. The reallocation also _will_ result in fragmentation of the
large-object heap, which can result in an out-of-memory error even when
there is theoretically space in the process's virtual address space to
allocate more memory.

Pre-allocating a pool of entry data structures as an array from which
you pull indices rather than references to actual objects allows that to
also be created once (or at least many fewer times, if you decide, for
example, to allocate arrays of 1000 at a time), reducing overhead in the
memory manager.

This last suggestion is not necessarily as pointless as it might seem.
Depending on the nature of each row of data in your CSV file, it stands
to reason that the _actual_ memory cost of your hashing should be quite
a bit less than the 4GB the file requires, on the assumption that you're
checking only a small subset of the entire row (two columns out of how
many, of what kind of data?).

So, sure...you're definitely pushing the limits of what can be done in
the 2GB of virtual address space allotted to a 32-bit process (3GB if
you boot with /3GB and set the /LARGEADDRESSAWARE for your
application...I don't recall if this is set by default for managed
apps). But assuming the columns that are used to determine duplication
are a less than, say, 1/6th or 1/8th of each row of data, then it seems
likely to me a 4GB file could be able to be processed even in a 32-bit
app, without resorting to using the file system as temporary storage
(other than implicit disk swapping, of course). "All" you need is a
more efficient implementation of the data structures.

If performance is very important, it could well be worth the extra
effort to approach the problem that way, rather than pushing the problem
out to the file system (which is admittedly simpler in some ways).

Note the various proposals fall into two broad categories:

-- Will still fail for slightly larger files
-- Scales well to significantly larger files

All of the memory-based solutions, while they may address your current
needs, can still be pushed to the breaking point for files even larger
than what you're dealing with. The database and file-system-only
solutions are limited only by the file system itself, and so are likely
to continue working on MUCH larger inputs, albeit at perhaps a larger
performance cost than the memory-based solutions.

In the end, it may well be that just adding more RAM and a 64-bit OS is
the best solution. :)

Pete
 
R

Robertico

Everyone warmly thanked. In particular, the detailed explanation.
In this particular case, the solution was a linux script.
Not what you would expect here, but is was efficient and the solution to my
problem.
I've learned a lot about memory usage and reading large files.
For another problem, I now know what to look out.

Regards,

Robertico
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top