PC Review


Reply
Thread Tools Rate Thread

Best Performance File Compare: MD5/SHA1 or Byte-by-Byte Checking?

 
 
Mahmoud Al-Qudsi
Guest
Posts: n/a
 
      4th Apr 2007
I'm looking to compare the contents of two files. Files will generally
not exceed 1024 *bytes* in length.
Given this info, and assuming that the accuracy/reliability of SHA1 is
more than enough, is it more efficient to

a) Use System.Security.Cryptography and get the SHA1 of each binary
file and compare the two hashes
b) Create a byte-by-byte checker that loops through the two files and
exits with a false when a byte doesn't match in the same location
between the two files?

Generally speaking, I'd use the second method when dealing with
anything larger than 512kb, expecting it to take less resources/time.

However, in the case of such small files, is SHA1 a better-performing
alternative? What about MD5?
Assuming 99% of the time the two files will match, is MD5's limited
reliability enough to determine whether the two files are a match? Is
the performance difference between MD5 and SHA1 worth going with MD5
or am I better off sticking with the latter?

I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
that it won't matter since byte-by-byte is more efficient and faster
code (assuming it's programmed half-decently of course)... But I'd
like to make sure since I'm looking for a minimal hit on system
resources.

Thanks!

 
Reply With Quote
 
 
 
 
Nicholas Paldino [.NET/C# MVP]
Guest
Posts: n/a
 
      4th Apr 2007
Mahmoud,

I hate to say this, but in the time it probably took to write this post,
you could have easily generated numbers which show the performance profiles
for your particular case.

I would look at the Stopwatch class, and then start testing to see how
long it would take to perform each operation. The operations themselves, as
well as the code to perform the timing, aren't difficult at all.

If I had to guess, for files that are 1024 bytes, it probably is easier
to just loop through them to see if any of the bytes differ. It would
probably be much faster than hashing the whole thing (since the hash has to
cycle through all of the bytes anyways, and you are cutting out if you find
a difference between any two of them).

Even in the 512kb case, you might want to use the method that loops
through two streams. This is an important point. Make sure you do not load
the entire contents of the two files into memory. For the small files, it's
no big deal, but for large files, you are going to take a hit trying to load
that into memory. By reading chunks of the files into memory, and then
comparing the chunks, you are going to make the process much more efficient.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (E-Mail Removed)

"Mahmoud Al-Qudsi" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> I'm looking to compare the contents of two files. Files will generally
> not exceed 1024 *bytes* in length.
> Given this info, and assuming that the accuracy/reliability of SHA1 is
> more than enough, is it more efficient to
>
> a) Use System.Security.Cryptography and get the SHA1 of each binary
> file and compare the two hashes
> b) Create a byte-by-byte checker that loops through the two files and
> exits with a false when a byte doesn't match in the same location
> between the two files?
>
> Generally speaking, I'd use the second method when dealing with
> anything larger than 512kb, expecting it to take less resources/time.
>
> However, in the case of such small files, is SHA1 a better-performing
> alternative? What about MD5?
> Assuming 99% of the time the two files will match, is MD5's limited
> reliability enough to determine whether the two files are a match? Is
> the performance difference between MD5 and SHA1 worth going with MD5
> or am I better off sticking with the latter?
>
> I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
> that it won't matter since byte-by-byte is more efficient and faster
> code (assuming it's programmed half-decently of course)... But I'd
> like to make sure since I'm looking for a minimal hit on system
> resources.
>
> Thanks!
>



 
Reply With Quote
 
Carl Daniel [VC++ MVP]
Guest
Posts: n/a
 
      4th Apr 2007
Mahmoud Al-Qudsi wrote:
> I'm looking to compare the contents of two files. Files will generally
> not exceed 1024 *bytes* in length.
> Given this info, and assuming that the accuracy/reliability of SHA1 is
> more than enough, is it more efficient to


Both MD5 and SHA1 are very complex hashes. Calculating the hash of a file,
regardless of size, will take several times longer than simply comparing the
two files. Hashes are best used for integrity verification when you only
have 1 copy of a file at a given location and want to ensure that it's the
same as another copy at another location (or another time, or both).

If you have both files, just compare them, regardless of how big they are.

-cd


 
Reply With Quote
 
Mahmoud Al-Qudsi
Guest
Posts: n/a
 
      4th Apr 2007
On Apr 4, 6:31 pm, "Nicholas Paldino [.NET/C# MVP]"
<m...@spam.guard.caspershouse.com> wrote:
> Mahmoud,
>
> I hate to say this, but in the time it probably took to write this post,
> you could have easily generated numbers which show the performance profiles
> for your particular case.
>
> I would look at the Stopwatch class, and then start testing to see how
> long it would take to perform each operation. The operations themselves, as
> well as the code to perform the timing, aren't difficult at all.
>
> If I had to guess, for files that are 1024 bytes, it probably is easier
> to just loop through them to see if any of the bytes differ. It would
> probably be much faster than hashing the whole thing (since the hash has to
> cycle through all of the bytes anyways, and you are cutting out if you find
> a difference between any two of them).
>
> Even in the 512kb case, you might want to use the method that loops
> through two streams. This is an important point. Make sure you do not load
> the entire contents of the two files into memory. For the small files, it's
> no big deal, but for large files, you are going to take a hit trying to load
> that into memory. By reading chunks of the files into memory, and then
> comparing the chunks, you are going to make the process much more efficient.
>
> Hope this helps.
>
> --
> - Nicholas Paldino [.NET/C# MVP]
> - m...@spam.guard.caspershouse.com
>
> "Mahmoud Al-Qudsi" <mqu...@gmail.com> wrote in message
>
> news:(E-Mail Removed)...
>
> > I'm looking to compare the contents of two files. Files will generally
> > not exceed 1024 *bytes* in length.
> > Given this info, and assuming that the accuracy/reliability of SHA1 is
> > more than enough, is it more efficient to

>
> > a) Use System.Security.Cryptography and get the SHA1 of each binary
> > file and compare the two hashes
> > b) Create a byte-by-byte checker that loops through the two files and
> > exits with a false when a byte doesn't match in the same location
> > between the two files?

>
> > Generally speaking, I'd use the second method when dealing with
> > anything larger than 512kb, expecting it to take less resources/time.

>
> > However, in the case of such small files, is SHA1 a better-performing
> > alternative? What about MD5?
> > Assuming 99% of the time the two files will match, is MD5's limited
> > reliability enough to determine whether the two files are a match? Is
> > the performance difference between MD5 and SHA1 worth going with MD5
> > or am I better off sticking with the latter?

>
> > I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
> > that it won't matter since byte-by-byte is more efficient and faster
> > code (assuming it's programmed half-decently of course)... But I'd
> > like to make sure since I'm looking for a minimal hit on system
> > resources.

>
> > Thanks!


Thanks for the info Nicholas,
I'm looking into the stopwatch class even now, and I'm feeling pretty
stupid at how easy it is to use ;-)

Of course I am using a buffer for my byte-by-byte comparer - loading
64 bytes at a time, which though isn't efficient is the best I can do
with such tiny files.

 
Reply With Quote
 
Samuel R. Neff
Guest
Posts: n/a
 
      4th Apr 2007

SHA1 and MD5 will both require looping through the whole file just to
generate the hash so in either method you're looping through both
files completely before doing any comparison.

With a byte-by-byte compare you can short-circuit your loop as soon as
there's a mismatch. Also, you don't even have to start looping if the
byte count is not exactly the same. Given these two optimizations
clearly a byte-by-byte comparison would be faster.

Now if you're talking about files that change rarely and you can cache
the hash value, then the answer could be different, and I might
suggest using CRC32 intead of either of the above. CRC32 isn't built
into .NET but there's a decent class on codeproject.com for it.

HTH,

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.



On 4 Apr 2007 08:16:09 -0700, "Mahmoud Al-Qudsi" <(E-Mail Removed)>
wrote:

>I'm looking to compare the contents of two files. Files will generally
>not exceed 1024 *bytes* in length.
>Given this info, and assuming that the accuracy/reliability of SHA1 is
>more than enough, is it more efficient to
>
>a) Use System.Security.Cryptography and get the SHA1 of each binary
>file and compare the two hashes
>b) Create a byte-by-byte checker that loops through the two files and
>exits with a false when a byte doesn't match in the same location
>between the two files?
>
>Generally speaking, I'd use the second method when dealing with
>anything larger than 512kb, expecting it to take less resources/time.
>
>However, in the case of such small files, is SHA1 a better-performing
>alternative? What about MD5?
>Assuming 99% of the time the two files will match, is MD5's limited
>reliability enough to determine whether the two files are a match? Is
>the performance difference between MD5 and SHA1 worth going with MD5
>or am I better off sticking with the latter?
>
>I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
>that it won't matter since byte-by-byte is more efficient and faster
>code (assuming it's programmed half-decently of course)... But I'd
>like to make sure since I'm looking for a minimal hit on system
>resources.
>
>Thanks!


 
Reply With Quote
 
Jon Skeet [C# MVP]
Guest
Posts: n/a
 
      4th Apr 2007
On Apr 4, 4:16 pm, "Mahmoud Al-Qudsi" <mqu...@gmail.com> wrote:
> I'm looking to compare the contents of two files. Files will generally
> not exceed 1024 *bytes* in length.
> Given this info, and assuming that the accuracy/reliability of SHA1 is
> more than enough, is it more efficient to
>
> a) Use System.Security.Cryptography and get the SHA1 of each binary
> file and compare the two hashes
> b) Create a byte-by-byte checker that loops through the two files and
> exits with a false when a byte doesn't match in the same location
> between the two files?


Unless you need to store the hash for a quick comparison later on, I
can't see any benefit in using a hash. It's going to have to go
through every byte of the file, and it's bound to have to do more work
than a simple comparison.

Things may change if you want to compare *several* files, or check
whether a file has changed compared with an earlier version.

Jon

 
Reply With Quote
 
Nicholas Paldino [.NET/C# MVP]
Guest
Posts: n/a
 
      4th Apr 2007
Mahmoud,

Why not load from the file in 1024 size chunks? This way, you have at
most one chunk that you load. Certainly, two blocks of 1024 bytes isn't
going to kill your app (unless you have a horrendously ancient machine).


--
- Nicholas Paldino [.NET/C# MVP]
- (E-Mail Removed)

"Mahmoud Al-Qudsi" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On Apr 4, 6:31 pm, "Nicholas Paldino [.NET/C# MVP]"
> <m...@spam.guard.caspershouse.com> wrote:
>> Mahmoud,
>>
>> I hate to say this, but in the time it probably took to write this
>> post,
>> you could have easily generated numbers which show the performance
>> profiles
>> for your particular case.
>>
>> I would look at the Stopwatch class, and then start testing to see
>> how
>> long it would take to perform each operation. The operations themselves,
>> as
>> well as the code to perform the timing, aren't difficult at all.
>>
>> If I had to guess, for files that are 1024 bytes, it probably is
>> easier
>> to just loop through them to see if any of the bytes differ. It would
>> probably be much faster than hashing the whole thing (since the hash has
>> to
>> cycle through all of the bytes anyways, and you are cutting out if you
>> find
>> a difference between any two of them).
>>
>> Even in the 512kb case, you might want to use the method that loops
>> through two streams. This is an important point. Make sure you do not
>> load
>> the entire contents of the two files into memory. For the small files,
>> it's
>> no big deal, but for large files, you are going to take a hit trying to
>> load
>> that into memory. By reading chunks of the files into memory, and then
>> comparing the chunks, you are going to make the process much more
>> efficient.
>>
>> Hope this helps.
>>
>> --
>> - Nicholas Paldino [.NET/C# MVP]
>> - m...@spam.guard.caspershouse.com
>>
>> "Mahmoud Al-Qudsi" <mqu...@gmail.com> wrote in message
>>
>> news:(E-Mail Removed)...
>>
>> > I'm looking to compare the contents of two files. Files will generally
>> > not exceed 1024 *bytes* in length.
>> > Given this info, and assuming that the accuracy/reliability of SHA1 is
>> > more than enough, is it more efficient to

>>
>> > a) Use System.Security.Cryptography and get the SHA1 of each binary
>> > file and compare the two hashes
>> > b) Create a byte-by-byte checker that loops through the two files and
>> > exits with a false when a byte doesn't match in the same location
>> > between the two files?

>>
>> > Generally speaking, I'd use the second method when dealing with
>> > anything larger than 512kb, expecting it to take less resources/time.

>>
>> > However, in the case of such small files, is SHA1 a better-performing
>> > alternative? What about MD5?
>> > Assuming 99% of the time the two files will match, is MD5's limited
>> > reliability enough to determine whether the two files are a match? Is
>> > the performance difference between MD5 and SHA1 worth going with MD5
>> > or am I better off sticking with the latter?

>>
>> > I'm guessing MD5 is good enough, that SHA1 takes a lot longer, and
>> > that it won't matter since byte-by-byte is more efficient and faster
>> > code (assuming it's programmed half-decently of course)... But I'd
>> > like to make sure since I'm looking for a minimal hit on system
>> > resources.

>>
>> > Thanks!

>
> Thanks for the info Nicholas,
> I'm looking into the stopwatch class even now, and I'm feeling pretty
> stupid at how easy it is to use ;-)
>
> Of course I am using a buffer for my byte-by-byte comparer - loading
> 64 bytes at a time, which though isn't efficient is the best I can do
> with such tiny files.
>



 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Can you improve this code : Search Byte[] backwards for byte Pattern Family Tree Mike Microsoft C# .NET 0 26th Sep 2008 12:34 AM
Reading a file byte by byte Seabass Microsoft C# .NET 6 20th May 2007 06:30 AM
Converting from byte to string and back to byte ends in different results? mfunkmann@yahoo.com Microsoft C# .NET 1 17th Dec 2006 02:53 PM
Looking for a prog to compare 2 files byte by byte Peter Veith Freeware 7 16th Sep 2003 10:10 PM
Convert native byte array (pointer) to managed byte[] Dave Microsoft Dot NET 1 13th Aug 2003 05:08 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 06:38 AM.