big file hashing

  • Thread starter Thread starter Pohihihi
  • Start date Start date
P

Pohihihi

I am implementing dup file finder via getting Hash (MD5 or SHA1) from files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.

Any better way anyone can think off?

Thank you,
 
The only way to be real sure is to do the whole hash. You could probably be
fairly sure by checking file size, datetime modified, and first (and maybe
last) xMB as you said. If the files are "yours", you could append a hash to
the file when created or modified, then you could just read the hash from
the start of the file as your comparer.
 
Thanks.


William Stacey said:
The only way to be real sure is to do the whole hash. You could probably
be fairly sure by checking file size, datetime modified, and first (and
maybe last) xMB as you said. If the files are "yours", you could append a
hash to the file when created or modified, then you could just read the
hash from the start of the file as your comparer.
 
You might try overlapped io and io completion on the large files. That will
allow multiple threads to read different sections of the file and generate
the hashes for each section asynchronously. I realize that each read is
still happening on the same hdd, I guess it depends on how much overhead the
hashing is, you could pick up some speed there.

jim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top