Best Performance File Compare: MD5/SHA1 or Byte-by-Byte Checking?

  • Thread starter Mahmoud Al-Qudsi
  • Start date
M

Mahmoud Al-Qudsi

I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to

a) Use System.Security.Cryptography and get the SHA1 of each binary
file and compare the two hashes
b) Create a byte-by-byte checker that loops through the two files and
exits with a false when a byte doesn't match in the same location
between the two files?

Generally speaking, I'd use the second method when dealing with
anything larger than 512kb, expecting it to take less resources/time.

However, in the case of such small files, is SHA1 a better-performing
alternative? What about MD5?
Assuming 99% of the time the two files will match, is MD5's limited
reliability enough to determine whether the two files are a match? Is
the performance difference between MD5 and SHA1 worth going with MD5
or am I better off sticking with the latter?

I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
that it won't matter since byte-by-byte is more efficient and faster
code (assuming it's programmed half-decently of course)... But I'd
like to make sure since I'm looking for a minimal hit on system
resources.

Thanks!
 
N

Nicholas Paldino [.NET/C# MVP]

Mahmoud,

I hate to say this, but in the time it probably took to write this post,
you could have easily generated numbers which show the performance profiles
for your particular case.

I would look at the Stopwatch class, and then start testing to see how
long it would take to perform each operation. The operations themselves, as
well as the code to perform the timing, aren't difficult at all.

If I had to guess, for files that are 1024 bytes, it probably is easier
to just loop through them to see if any of the bytes differ. It would
probably be much faster than hashing the whole thing (since the hash has to
cycle through all of the bytes anyways, and you are cutting out if you find
a difference between any two of them).

Even in the 512kb case, you might want to use the method that loops
through two streams. This is an important point. Make sure you do not load
the entire contents of the two files into memory. For the small files, it's
no big deal, but for large files, you are going to take a hit trying to load
that into memory. By reading chunks of the files into memory, and then
comparing the chunks, you are going to make the process much more efficient.

Hope this helps.
 
C

Carl Daniel [VC++ MVP]

Mahmoud said:
I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to

Both MD5 and SHA1 are very complex hashes. Calculating the hash of a file,
regardless of size, will take several times longer than simply comparing the
two files. Hashes are best used for integrity verification when you only
have 1 copy of a file at a given location and want to ensure that it's the
same as another copy at another location (or another time, or both).

If you have both files, just compare them, regardless of how big they are.

-cd
 
M

Mahmoud Al-Qudsi

Mahmoud,

I hate to say this, but in the time it probably took to write this post,
you could have easily generated numbers which show the performance profiles
for your particular case.

I would look at the Stopwatch class, and then start testing to see how
long it would take to perform each operation. The operations themselves, as
well as the code to perform the timing, aren't difficult at all.

If I had to guess, for files that are 1024 bytes, it probably is easier
to just loop through them to see if any of the bytes differ. It would
probably be much faster than hashing the whole thing (since the hash has to
cycle through all of the bytes anyways, and you are cutting out if you find
a difference between any two of them).

Even in the 512kb case, you might want to use the method that loops
through two streams. This is an important point. Make sure you do not load
the entire contents of the two files into memory. For the small files, it's
no big deal, but for large files, you are going to take a hit trying to load
that into memory. By reading chunks of the files into memory, and then
comparing the chunks, you are going to make the process much more efficient.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)


I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to
a) Use System.Security.Cryptography and get the SHA1 of each binary
file and compare the two hashes
b) Create a byte-by-byte checker that loops through the two files and
exits with a false when a byte doesn't match in the same location
between the two files?
Generally speaking, I'd use the second method when dealing with
anything larger than 512kb, expecting it to take less resources/time.
However, in the case of such small files, is SHA1 a better-performing
alternative? What about MD5?
Assuming 99% of the time the two files will match, is MD5's limited
reliability enough to determine whether the two files are a match? Is
the performance difference between MD5 and SHA1 worth going with MD5
or am I better off sticking with the latter?
I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
that it won't matter since byte-by-byte is more efficient and faster
code (assuming it's programmed half-decently of course)... But I'd
like to make sure since I'm looking for a minimal hit on system
resources.

Thanks for the info Nicholas,
I'm looking into the stopwatch class even now, and I'm feeling pretty
stupid at how easy it is to use ;-)

Of course I am using a buffer for my byte-by-byte comparer - loading
64 bytes at a time, which though isn't efficient is the best I can do
with such tiny files.
 
S

Samuel R. Neff

SHA1 and MD5 will both require looping through the whole file just to
generate the hash so in either method you're looping through both
files completely before doing any comparison.

With a byte-by-byte compare you can short-circuit your loop as soon as
there's a mismatch. Also, you don't even have to start looping if the
byte count is not exactly the same. Given these two optimizations
clearly a byte-by-byte comparison would be faster.

Now if you're talking about files that change rarely and you can cache
the hash value, then the answer could be different, and I might
suggest using CRC32 intead of either of the above. CRC32 isn't built
into .NET but there's a decent class on codeproject.com for it.

HTH,

Sam
 
J

Jon Skeet [C# MVP]

I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to

a) Use System.Security.Cryptography and get the SHA1 of each binary
file and compare the two hashes
b) Create a byte-by-byte checker that loops through the two files and
exits with a false when a byte doesn't match in the same location
between the two files?

Unless you need to store the hash for a quick comparison later on, I
can't see any benefit in using a hash. It's going to have to go
through every byte of the file, and it's bound to have to do more work
than a simple comparison.

Things may change if you want to compare *several* files, or check
whether a file has changed compared with an earlier version.

Jon
 
N

Nicholas Paldino [.NET/C# MVP]

Mahmoud,

Why not load from the file in 1024 size chunks? This way, you have at
most one chunk that you load. Certainly, two blocks of 1024 bytes isn't
going to kill your app (unless you have a horrendously ancient machine).


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Mahmoud Al-Qudsi said:
Mahmoud,

I hate to say this, but in the time it probably took to write this
post,
you could have easily generated numbers which show the performance
profiles
for your particular case.

I would look at the Stopwatch class, and then start testing to see
how
long it would take to perform each operation. The operations themselves,
as
well as the code to perform the timing, aren't difficult at all.

If I had to guess, for files that are 1024 bytes, it probably is
easier
to just loop through them to see if any of the bytes differ. It would
probably be much faster than hashing the whole thing (since the hash has
to
cycle through all of the bytes anyways, and you are cutting out if you
find
a difference between any two of them).

Even in the 512kb case, you might want to use the method that loops
through two streams. This is an important point. Make sure you do not
load
the entire contents of the two files into memory. For the small files,
it's
no big deal, but for large files, you are going to take a hit trying to
load
that into memory. By reading chunks of the files into memory, and then
comparing the chunks, you are going to make the process much more
efficient.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)


I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to
a) Use System.Security.Cryptography and get the SHA1 of each binary
file and compare the two hashes
b) Create a byte-by-byte checker that loops through the two files and
exits with a false when a byte doesn't match in the same location
between the two files?
Generally speaking, I'd use the second method when dealing with
anything larger than 512kb, expecting it to take less resources/time.
However, in the case of such small files, is SHA1 a better-performing
alternative? What about MD5?
Assuming 99% of the time the two files will match, is MD5's limited
reliability enough to determine whether the two files are a match? Is
the performance difference between MD5 and SHA1 worth going with MD5
or am I better off sticking with the latter?
I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
that it won't matter since byte-by-byte is more efficient and faster
code (assuming it's programmed half-decently of course)... But I'd
like to make sure since I'm looking for a minimal hit on system
resources.

Thanks for the info Nicholas,
I'm looking into the stopwatch class even now, and I'm feeling pretty
stupid at how easy it is to use ;-)

Of course I am using a buffer for my byte-by-byte comparer - loading
64 bytes at a time, which though isn't efficient is the best I can do
with such tiny files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top