File Hashing

Johnny Jörgensen · May 28, 2008

I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

2) Two different files will NEVER have the same hash value?

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

TIA,
Johnny J.

Barry Kelly · May 28, 2008

Johnny said:
I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

No, but fairly sure.

2) Two different files will NEVER have the same hash value?

No, but fairly sure.

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

No, but fairly sure.

-- Barry

sylvain.rodrigue · May 28, 2008

I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

2) Two different files will NEVER have the same hash value?

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

TIA,
Johnny J.

Hello,

All hashing functions have a finite set of return values (say : 2^128)
but an infinite number of possible input values. This clearly implies
that two input values CAN generate the same output value.

But in practice, the probability that you can find two input values
generating the same hash signature are pretty close to zero. I would
say :

1) Yes. It will be the same file (well, most of the time, read this :
http://www.mathstat.dal.ca/~selinger/md5collision/)
2) Yes.
3) No. Using both an MD5 and a SHA-1 will in fact reduce the number of
possible collisions.

Jon Skeet [C# MVP] · May 28, 2008

Johnny Jörgensen said:
I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

No. Think how much data is contained in a hash. Suppose you have a 128
bit hash. Now think about just files which are (say) 136 bits in
length. How many possible files of that length are there? Now how many
possible 128 bit hash values are there?

A slightly different way of looking at this: suppose you see some
people, and label each one with a different (capital) letter of the
alphabet to tell them apart. When you've got more than 26 people,
you're *bound* to have at least two people who have the same letter.

2) Two different files will NEVER have the same hash value?

That's the same question as question 1.

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

No, not necessarily. It's incredibly likely - hashes are designed such
that you'd be extremely unlucky to run into two files with the same
hash but different content. It's possible though.

KH · May 28, 2008

You could use the file length as an additional piece of "metadata" -- if two
files were to have the same hash but different byte lengths then they are not
the same. That's probably going to solve most hash collissions. If you do
find a case of two files having the same hash and length, then you need to do
a byte-for-byte comparison to determine equality.

HTH

Arne Vajhøj · May 28, 2008

Johnny said:
If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

2) Two different files will NEVER have the same hash value?

Other have already answered that question.

But there is an important point that should be
emphasized:

* if you want to protect against accidentally matching
files, then you should not worry, the probabilities
of 1/2^128 and 1/2^160 are close to impossible, so
both MD5 and SHA1 are fine

* if you want to protect against malicious matching
files then it a completely different game - MD5 is
completely broken and SHA1 is somewhat broken - neither
is usable and you should go for SHA256 instead

Arne

Arne VajhÃ¸j · May 28, 2008

Peter said:
Granted, I'm not a crypto expert. However, I'd say the answer to this
is also "no". If MD5 provided just as much differentiating power as
SHA1, even though it's 128 bits while SHA1 is 160 bits, then why would
anyone bother with SHA1? No, I think it's safe to say that there are at
least some pairs of files for which the MD5 hash is identical, but the
SHA1 hash is not.

I would rephrase it as: if identical MD5 hash implied identical SHA1
hash, then SHA1 could only return 2^128 different values.

Of course, finding two different files that produce the exact same hash
in either algorithm is either contrived or very difficult.

Serious weaknesses in both has been found.

Arne

Cor Ligthert[MVP] · May 29, 2008

Johnny,

No you only will be sure that there is a low change that somebody can create
your files new with guessing what it would have as content.

The check if something is complete has in my idea nothing to do with an
security encryption.

Cor

Johnny Jörgensen · May 29, 2008

Good idea - Thanks

/Johnny

KH said:
You could use the file length as an additional piece of "metadata" -- if
two
files were to have the same hash but different byte lengths then they are
not
the same. That's probably going to solve most hash collissions. If you do
find a case of two files having the same hash and length, then you need to
do
a byte-for-byte comparison to determine equality.

HTH

Johnny Jörgensen · May 29, 2008

Thanks

/Johnny J.

Barry Kelly said:
No, but fairly sure.

No, but fairly sure.

No, but fairly sure.

-- Barry

Johnny Jörgensen · May 29, 2008

Hi Peter

First of all - thanks for your reply to my question. Much appreciated.

As for your comments on my posting technique:

1) The question is NOT crossposted. It would have been if I posted TWO
seperate messages to which people responded indicidually. But I have posted
ONE message to two different groups and replies from one group will show up
in the other.

2) Are you a programmer at all? How can you reason that a post that's
relevant in a C# group cannot possibly be relevant in a VB.NET group? The
only difference (ok maybe not the only, but the most important difference)
is different syntax. If somebody has a general question about the
functionality of a .NET class then syntax doesn't matter, and a VB.NET
programmer can just as well tell you the correct answer as a C# programmer.

3) To everybody who answered that my first two questions were identical:
They're not - it depends on the answer.

Thanks,
Johnny J.

Johnny Jörgensen · May 29, 2008

Thanks

Johnny J.

<[email protected]> skrev i meddelandet

I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

2) Two different files will NEVER have the same hash value?

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

TIA,
Johnny J.

Hello,

All hashing functions have a finite set of return values (say : 2^128)
but an infinite number of possible input values. This clearly implies
that two input values CAN generate the same output value.

But in practice, the probability that you can find two input values
generating the same hash signature are pretty close to zero. I would
say :

1) Yes. It will be the same file (well, most of the time, read this :
http://www.mathstat.dal.ca/~selinger/md5collision/)
2) Yes.
3) No. Using both an MD5 and a SHA-1 will in fact reduce the number of
possible collisions.

Johnny Jörgensen · May 29, 2008

Thanks

Johnny J.

"Jon Skeet [C# MVP]" <[email protected]> skrev i meddelandet

Johnny Jörgensen said:
I'm wondering (and hoping that somebody will be able to answer this):

If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:

1) Two files with the same hash value are in fact identical?

No. Think how much data is contained in a hash. Suppose you have a 128
bit hash. Now think about just files which are (say) 136 bits in
length. How many possible files of that length are there? Now how many
possible 128 bit hash values are there?

A slightly different way of looking at this: suppose you see some
people, and label each one with a different (capital) letter of the
alphabet to tell them apart. When you've got more than 26 people,
you're *bound* to have at least two people who have the same letter.

2) Two different files will NEVER have the same hash value?

That's the same question as question 1.

3) If two files have the same MD5 hash value, they will ALSO have the same
SHA1 hash value (I should think that will always be the case)?

No, not necessarily. It's incredibly likely - hashes are designed such
that you'd be extremely unlucky to run into two files with the same
hash but different content. It's possible though.

Johnny Jörgensen · May 29, 2008

Thanks

Johnny J.

Arne Vajhøj said:
Other have already answered that question.

But there is an important point that should be
emphasized:

* if you want to protect against accidentally matching
files, then you should not worry, the probabilities
of 1/2^128 and 1/2^160 are close to impossible, so
both MD5 and SHA1 are fine

* if you want to protect against malicious matching
files then it a completely different game - MD5 is
completely broken and SHA1 is somewhat broken - neither
is usable and you should go for SHA256 instead

Arne

Johnny Jörgensen · May 29, 2008

That wasn't the intention behind my question either. Simply to determine if
two files are identical or not.

/Johnny J.

Barry Kelly · May 29, 2008

KH said:
You could use the file length as an additional piece of "metadata" -- if two
files were to have the same hash but different byte lengths then they are not
the same. That's probably going to solve most hash collissions.

File length (specifically, bit count) is part of the MD5 and SHA1 hash
calculations. There is less information per bit of a separate length
indicator than you're getting out of the bits in the MD5 or SHA1 hashes.
If minimizing collisions is the priority, then using a better hash
function, like SHA-224, SHA-256, etc. will give better "bang for buck"
in terms of bit information. Given that you could probably expect file
length to require a 64-bit number, choosing SHA-224 over SHA-160 seems
to be obvious.

Because of the birthday paradox, accidental collisions with hash
functions are more common than the astronomical numbers like 2**128 and
2**160 seem to suggest; 50% chance with roughly 1.25 times the square
root of the number of possible hash values, assuming the hash values are
distributed evenly.

That works out to a 50% chance of collision after around 2**64 (MD5) or
2**80 (SHA-1).

2**64 and 2**80 are still large numbers, unlikely to be met in practice
where file comparison is the goal of hashing.

Of course, specially crafted collisions have been found for MD5, and
attacks are underway with 2**35 evaluations for SHA-1. But these won't
be of concern for file comparison.

-- Barry

Barry Kelly · May 29, 2008

Johnny said:
As for your comments on my posting technique:

1) The question is NOT crossposted. It would have been if I posted TWO
seperate messages to which people responded indicidually.

That is called multi-posting. Your message was indeed cross-posted.

2) Are you a programmer at all? How can you reason that a post that's
relevant in a C# group cannot possibly be relevant in a VB.NET group?

The question was a general cryptography one, and indeed could be argued
wasn't relevant to any language-specific group. .framework might have
been closest, had it been phrased as recommended use of
System.Security.Cryptography.

-- Barry

Cor Ligthert [MVP] · May 29, 2008

In my idea would that mean, that you can make from the hash code the file.

Cor

Jon Skeet [C# MVP] · May 29, 2008

3) To everybody who answered that my first two questions were identical:
They're not - it depends on the answer.

No it doesn't. They're logically equivalent.

Suppose the answer to question 2 was "yes" - i.e. there can never be
two different files with the same hash value. That implies that if two
files have the same hash value, they can't be different, i.e. they are
identical. Hence the answer to question 1 would have to be "yes" as
well.

To put it in pure logic terms, if A is the predicate "X and Y have
identical hashes" and B is the predicate "X and Y are identical
files", your questions were:

1) A => B
2) !B => !A

I hope you can see how these are logically equivalent questions.

Jon

Joergen Bech · May 29, 2008

Well, if you could create the file from the hash code, then it really
wouldn't be hashing, would it?

Then it would be cryptography, in which case there would be no
point for the purposes of what the OP is trying to achieve.

Regards,

Joergen Bech

Hash MD5, Sha1 and Length	40	Sep 13, 2009
Trouble using MD5 and SHA1 digests	2	Sep 4, 2010
.Net Tips , C# Tip : Check file contents is changed using calculateand comparing hash code of a file	1	Jul 13, 2012
Formatting the output of hash values	1	Mar 11, 2007
C# MD5 Hash Does Not Match Hash Generated From Java	3	Jun 5, 2011
SHA1 Hash question with large Files	5	Jul 28, 2005
big file hashing	3	Aug 19, 2005
Partial file hashing	9	Dec 10, 2008

File Hashing

Johnny Jörgensen

Barry Kelly

sylvain.rodrigue

Jon Skeet [C# MVP]

KH

Arne Vajhøj

Arne VajhÃ¸j

Cor Ligthert[MVP]

Johnny Jörgensen

Johnny Jörgensen

Johnny Jörgensen

Johnny Jörgensen

Johnny Jörgensen

Johnny Jörgensen

Johnny Jörgensen

Barry Kelly

Barry Kelly

Cor Ligthert [MVP]

Jon Skeet [C# MVP]

Joergen Bech

Ask a Question

Similar Threads