big file hashing

Pohihihi · Aug 19, 2005

I am implementing dup file finder via getting Hash (MD5 or SHA1) from files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.

Any better way anyone can think off?

Thank you,

William Stacey [MVP] · Aug 19, 2005

The only way to be real sure is to do the whole hash. You could probably be
fairly sure by checking file size, datetime modified, and first (and maybe
last) xMB as you said. If the files are "yours", you could append a hash to
the file when created or modified, then you could just read the hash from
the start of the file as your comparer.

Pohihihi · Aug 21, 2005

Thanks.

William Stacey said:
The only way to be real sure is to do the whole hash. You could probably
be fairly sure by checking file size, datetime modified, and first (and
maybe last) xMB as you said. If the files are "yours", you could append a
hash to the file when created or modified, then you could just read the
hash from the start of the file as your comparer.

Jim H · Aug 22, 2005

You might try overlapped io and io completion on the large files. That will
allow multiple threads to read different sections of the file and generate
the hashes for each section asynchronously. I realize that each read is
still happening on the same hdd, I guess it depends on how much overhead the
hashing is, you could pick up some speed there.

jim

File Hashing	22	May 28, 2008
Hashing algorithms in .Net Framework	3	Aug 20, 2012
Partial file hashing	9	Dec 10, 2008
Formatting the output of hash values	1	Mar 11, 2007
Adding MD5CryptoServiceProvider hash to an XMLSerializer.Serialize	2	Jul 16, 2007
MD5 Hash - Converting Result From Byte[] Back To String	7	Oct 24, 2005
SHA1 Hash question with large Files	5	Jul 28, 2005
Generating a unique hash of a known length	7	Jul 22, 2009

big file hashing

Pohihihi

William Stacey [MVP]

Pohihihi

Jim H

Ask a Question

Similar Threads