about Equals and GetHashCode

Tony Johansson · Jun 12, 2008

Hello!

I can't figure out what point it is to use GetHashCode. I know that this
GetHashCode is used for obtaining a unique integer
value.
Can somebody give me an example that prove the usefulness of this
GetHashCode or it it of no use at all?

A mean that if I get an integer from current time in some way what should I
use it for?

One more question should these Equals and GetHashCode be used in pair?

//Tony

Peter Morris · Jun 12, 2008

If you override GetHashCode you should also override Equals, to ensure that
when two objects are equal their hashcodes are the same as well. As for an
example of its use, it is used in Hashtable and Dictionary<,>

--
Pete
=========================================
I use Enterprise Core Objects (Domain driven design)
http://www.capableobjects.com/
=========================================

Marc Gravell · Jun 12, 2008

No; GetHashCode() does not guarantee uniqueness - for starters, it is
an Int32 - it would take a trivial example involving Int64 and i++ to
show that they aren't unique.

The point is that if a.GetHashCode() and b.GetHashCode() are
*different* then a nd b definitely aren't the same. If the hashcodes
are the same, then you need to check a and b. This provides an
efficient way of narrowing down the amount of things we actually need
to check, by first excluding everything that doesn't have the same
hashcode,
leaving a much smaller set.
Further, advanced structures such as hash-tables can use the hash-code
to create "buckets" - so, for example, based on the number of items,
it might decide to split it into 20 "buckets" by taking the modulo of
the hashcode - i.e. each item goes into the bucket with index
GetHashCode() % bucketCount.

Then, to find a given item, it can do:
* get the haschode of the search key
* calculate the modulo of that hashcode - we now know which bucket to
look in [down to 1/20th of the data already]
* look in that bucket for everything with the correct hashcode
[probably down to 1 or 2 items now]
* perform the actual Equals between the search key and the stored
items, to find the one we want

So: GetHashCode() is very rarely used without Equals() [although
Equals() can be used without GetHashCode()]; it is definitely very
important, and underpins all the fast lookups from Dictionary<,>, etc.
Linear search (testing each item) would be very, very slow on a long
list.

Marc

Marc Gravell · Jun 12, 2008

(for info, Dictionary<,> and HashTable actually use a more
sophisticated algorithm - but that gives an overall picture of how a
hashcode might be used in theory)

Tony Johansson · Jun 12, 2008

Hello!

When I override the Equals method I should also overide the GetHashCode
according to others and all documentaion that I have read but why?
I mean in the Equals method I implement what I mean to say that these two
object are equals I can't see
any point what the GetHashCode can add to my implementation of the Equals
method?

I mean if I for example have a Person with an integer of age in this class
Person.
I just check if the ages in these two object are equal if so then return
true else false from the Equals method.
But here if I also override the GetHashCode what can this method add to this
checking if the two ages in my two
objects are equal or not?

//Tony

Jon Skeet [C# MVP] · Jun 12, 2008

When I override the Equals method I should also overide the GetHashCode
according to others and all documentaion that I have read but why?
I mean in the Equals method I implement what I mean to say that these two
object are equals I can't see
any point what the GetHashCode can add to my implementation of the Equals
method?

Two objects which are considered equal should return the same hash
code. If they do, that type can be used in a hashtable (such as
Dictionary). If they don't, a hash lookup will (potentially) fail for
equal keys, which violates the point of a hashtable.

I mean if I for example have a Person with an integer of age in this class
Person.
I just check if the ages in these two object are equal if so then return
true else false from the Equals method.
But here if I also override the GetHashCode what can this method add to this
checking if the two ages in my two
objects are equal or not?

If Equals is just checking the person's age, then returning the age as
the hash code is a reasonable start.

Generally you build a hash up from the values which are compared for
equality.

Jon

Tony Johansson · Jun 12, 2008

Very good answer Jon
short and consist exact what I wanted.

Many thanks!!

//Tony

Tony Johansson · Jun 12, 2008

Hello!

One more thing how will this GetHashCode be called I assume that it's not
called explicitly like someObject.GetHasCode();

//Tony

Jon Skeet [C# MVP] · Jun 12, 2008

One more thing how will this GetHashCode be called I assume that it's not
called explicitly like someObject.GetHasCode();

It certainly could be. It's just a normal method. It will usually be
called by either:
a) an implementation of a hashtable, such as Dictionary
b) another object which contains an instance of your type, is using
that as part of equality comparison, and wants to use it to build its
own hashcode.

As an example of b) if your person had a name as well as an age, you
might do:

public override int GetHashCode()
{
int ret = 17;
ret = 37*ret + age;
ret = 37*ret + name.GetHashCode();
return ret;
}

Jon

Brian Gideon · Jun 12, 2008

It certainly could be. It's just a normal method. It will usually be
called by either:
a) an implementation of a hashtable, such as Dictionary
b) another object which contains an instance of your type, is using
that as part of equality comparison, and wants to use it to build its
own hashcode.

As an example of b) if your person had a name as well as an age, you
might do:

public override int GetHashCode()
{
int ret = 17;
ret = 37*ret + age;
ret = 37*ret + name.GetHashCode();
return ret;

}

Jon

Maybe it's just a typo, but did you mean to use age as-is or did you
mean to call GetHashCode on it too? The reason I'm asking is because
the way it is written would cause people with the same name to cluster
together. The effect would be more severe had age been merged into
the hash code after name.

Also, in the interest of playing the Monday morning quarterback, I've
wondered whether Microsoft would have been better off forcing all
dictionary keys to implement a special interface, say IDictionaryKey,
that inherits IEquatable and provides the GetHashCode method. That
way the requirement for implementing GetHashCode and Equals in tandem
could be enforced declaratively instead of having a special case the
compiler must check for. Opinion?

Peter Morris · Jun 12, 2008

Maybe it's just a typo, but did you mean to use age as-is or did you

mean to call GetHashCode on it too?

GetHashCode returns an Int32. int == Int32. Therefore the value of an int
is already as unique as you can represent given 4 bytes. Therefore:

public struct Int32
{
...
public override int GetHashCode()
{
return this;
}
...
}

--
Pete
=========================================
I use Enterprise Core Objects (Domain driven design)
http://www.capableobjects.com/
=========================================

Brian Gideon · Jun 12, 2008

I'm not sure what you mean about "cluster". The general rule is that a
type that's already effectively an int can just be used directly as its
own hash code. I'd be surprised if int.GetHashCode() does anything except
return the int value anyway, but regardless, there's no reason not to just
use the int directly as its own hash code.

You're right Int32 returns a hash code that is equal to the value so
it doesn't matter at all if GetHashCode is called. I'm using the term
clustering to describe the situation that arises when a set of objects
yield hash codes that tend to bunch together within a smaller subset
of the entire Int32 range because of similarities in there
attributes. Clustering causes inefficiency in a lot hash tables
implementations.

In the example people with the same name, but different ages would
yield hash codes that would cluster around around a value dominated by
the name and spread out over range of no more than 3700 (since 100 is
a reasonable life expectancy). The effect is that the hashtable may
become no better than a linked list if it contains people with the
same name.

I _do_ believe that pre-initializing the hashcode with 629 is pointless.
You could start at 0 just as easily, and the values would mathematically
be distributed the same, I believe. At least, the last time I brought
that up (in the Java newsgroup) no one provided any counter-proof, and the
only reponses were in agreement with my statement. It could be that just
means they are all as dumb as me, but I don't think there's an actual
mathematical difference.

Don't know.

I'm also not sure what you mean here. Equals and GetHashCode are defined
in Object. All types have a default implementation (for reference types,
the default implementation is simply reference equality), so the compiler
doesn't need to check for them at all. The default implementation may or
may not be what's desired with respect to a dictionary key, but that's not
something the compiler can figure out anyway. There's no "special case
the compiler must check for".

So, no...I don't see the value in having an IDictionaryKey interface.

The compiler has to be handling this as a special case otherwise you
wouldn't get the warning that "...overrides Object.Equals but does not
override Object.GetHashCode". An interface based approach would
eliminate this special case and be able to strictly enforce this
relationship between GetHashCode and Equals. Not mention that it
would encourage developers to put more thought into using custom
objects as dictionary keys (which they probably should be anyway).

Marc Gravell · Jun 12, 2008

Not mention that it

would encourage developers to put more thought into using custom
objects as dictionary keys (which they probably should be anyway).

My only problem with that is that people tend to forget that the
hashcode should be immutable, at least once it has been used as a key
- otherwise it can become unfindable...

Re the clustering - the most common mistake people make here would be
using xor - any matched pairs come out at zero, making for a great big
collision. I don't expect I'm telling anybody involved in this
dicsussion anything they don't know, but for the list's benefit ;-p

Marc

Jon Skeet [C# MVP] · Jun 12, 2008

Brian Gideon said:
Maybe it's just a typo, but did you mean to use age as-is or did you
mean to call GetHashCode on it too? The reason I'm asking is because
the way it is written would cause people with the same name to cluster
together. The effect would be more severe had age been merged into
the hash code after name.

I'd probably just leave it at that. Calling GetHashCode on age wouldn't
have don't anything, as Int32's implementation of GetHashCode is just
to return the original value.

I'd hope that a good hashtable implementation would effectively ignore
the "closeness" of hashcodes to some extent, for instance by bucketing
with some appropriate mod function.

Also, in the interest of playing the Monday morning quarterback, I've
wondered whether Microsoft would have been better off forcing all
dictionary keys to implement a special interface, say IDictionaryKey,
that inherits IEquatable and provides the GetHashCode method. That
way the requirement for implementing GetHashCode and Equals in tandem
could be enforced declaratively instead of having a special case the
compiler must check for. Opinion?

Agreed. Ironically I was thinking about blogging precisely this point,
although for slightly different reasons. Many types - most types, even
- aren't really suitable as hash keys unless you're actually talking
about identity. That can still be useful, but it's probably worth
knowing what you're talking about.

So long as you made an implementation of IEqualityComparer available
which took the "native" implementation of the current
object.GetHashCode available to cope with the identity part, I think it
would be entirely reasonable to not have GetHashCode and Equals as part
of object itself. I think of it as a mistake that Java made and then
..NET copied.

(There may have been platforms before Java which had a single type
hierarchy where the uber-type had hashcode and equality defined - I
just don't know of it.)

Jon Skeet [C# MVP] · Jun 12, 2008

Peter Duniho said:
I _do_ believe that pre-initializing the hashcode with 629 is pointless.

It's pointless for all but purposes of symmetry. It means that *every*
field, from first to last, is treated the same way: take "current
value", multiply by constant, add hash for field.

It means that while you're deciding equality / hashcode implementation,
you can add and remove fields at will. No particular reason beyond
that.

Jon Skeet [C# MVP] · Jun 12, 2008

Brian Gideon said:
You're right Int32 returns a hash code that is equal to the value so
it doesn't matter at all if GetHashCode is called. I'm using the term
clustering to describe the situation that arises when a set of objects
yield hash codes that tend to bunch together within a smaller subset
of the entire Int32 range because of similarities in there
attributes. Clustering causes inefficiency in a lot hash tables
implementations.

I *hope* that the .NET implementations don't suffer from that problem.

In the example people with the same name, but different ages would
yield hash codes that would cluster around around a value dominated by
the name and spread out over range of no more than 3700 (since 100 is
a reasonable life expectancy). The effect is that the hashtable may
become no better than a linked list if it contains people with the
same name.

Well, even if it has a linked list of people with the same name it can
still get rid of definite non-matches (same name, different age)
without doing a single comparison of actual people.

Brian Gideon · Jun 13, 2008

I'd probably just leave it at that. Calling GetHashCode on age wouldn't
have don't anything, as Int32's implementation of GetHashCode is just
to return the original value.

Yeah, after realizing Int32.GetHashCode just returns the same value
I'd leave it as-is as well. It's too much effort trying to think of a
better distribution especially considering that people with same name
(first and last anyway) doesn't occur frequently enough.

Ya know, my naive line of thinking was that Int32.GetHashCode would
return something other than the original value so that the
distribution would be completely random. As it is now the
distribution can definitely be considered good since the hash code has
the same entropy as the original value, but it's not at all random.
To be perfectly honest, I'm not sure how important it is for the hash
code to be absolutely random as opposed to just good.

I'd hope that a good hashtable implementation would effectively ignore
the "closeness" of hashcodes to some extent, for instance by bucketing
with some appropriate mod function.

The documentation does mention that for best performance the
distribution of GetHashCode should be random. But, I'm sure that
Hashtable and Dictionary both have good implementations and that the
clustering around the name in your example is probably negligible. I
brought it up because I thought it would stimulate some interesting
discussions at the very least.

Several years ago I looked at the shared source code for Hashtable and
I believe it was using the guadratic probing method. It's been awhile
since my data structures class so I'm a little rusty, but I believe
that method does offer some protection from the clustering problem.
Anyone know what method Dictionary uses? I thought it was different
from Hashtable.

Agreed. Ironically I was thinking about blogging precisely this point,
although for slightly different reasons. Many types - most types, even
- aren't really suitable as hash keys unless you're actually talking
about identity. That can still be useful, but it's probably worth
knowing what you're talking about.

So long as you made an implementation of IEqualityComparer available
which took the "native" implementation of the current
object.GetHashCode available to cope with the identity part, I think it
would be entirely reasonable to not have GetHashCode and Equals as part
of object itself. I think of it as a mistake that Java made and then
.NET copied.

(There may have been platforms before Java which had a single type
hierarchy where the uber-type had hashcode and equality defined - I
just don't know of it.)

--

I think that would be an interesting topic for a blog. I do try to
read your blog posts so I'll catch it if you do decide to post on the
topic.

Brian Gideon · Jun 13, 2008

My only problem with that is that people tend to forget that the
hashcode should be immutable, at least once it has been used as a key
- otherwise it can become unfindable...

That's seems so obvious, but I could actually see myself making that
very mistake. You've got to admit that such a simple concept could be
easily overlooked.

Re the clustering - the most common mistake people make here would be
using xor - any matched pairs come out at zero, making for a great big
collision. I don't expect I'm telling anybody involved in this
dicsussion anything they don't know, but for the list's benefit ;-p

Yep, I'm aware of the problems in using xor, but I bet most developers
aren't. I don't override GetHashCode very often, but when I do I
sometimes use xor because it doesn't require any thought, but I always
put a TODO comment in reminding myself that I need to come back and
change it.

Brian Gideon · Jun 13, 2008

You could still multiply the 0 by 37 in the first field incorporated,
preserving the symmetry aspect. I just don't see the point in starting
with 17 versus 0. It has the sense of "magic" about it, but unlike the
multiplication by 37, I don't see any practical advantage. A less
"magical" initializer of 0 would be just as good.

I think.

Pete

It does seem like magic doesn't it? But, then again mathematics is
full intriguing quirks so it wouldn't surprise me if there was a valid
reason.

It reminds of me of the section in Knuth Vol 2 where he talks about
random number generators and warns that using one random number
generator to seed another generator may actually make the end result
*less* random! I know...completely off topic. My mind works in
mysterious ways

Jon Skeet [C# MVP] · Jun 13, 2008

No, my point is that you could just initialize the local hash variable to
0, rather than 17.

You could still multiply the 0 by 37 in the first field incorporated,
preserving the symmetry aspect. I just don't see the point in starting
with 17 versus 0. It has the sense of "magic" about it, but unlike the
multiplication by 37, I don't see any practical advantage. A less
"magical" initializer of 0 would be just as good.

I think.

Okay, I'll have to consult Effective Java, which is where the advice
originally came from. I believe there's some benefit to having two
magic numbers which are coprime to each other. Wikipedia doesn't show
this as one of the example hash functions, unfortunately, so there's
no discussion there to help

Jon

about Equals and GetHashCode

Tony Johansson

Peter Morris

Marc Gravell

Marc Gravell

Tony Johansson

Jon Skeet [C# MVP]

Tony Johansson

Tony Johansson

Jon Skeet [C# MVP]

Brian Gideon

Peter Morris

Brian Gideon

Marc Gravell

Jon Skeet [C# MVP]

Jon Skeet [C# MVP]

Jon Skeet [C# MVP]

Brian Gideon

Brian Gideon

Brian Gideon

Jon Skeet [C# MVP]