Hashing: "blair" == "brainlessness" !!!!

F

Francesc Benavent

I have some problem with a hashtable keyed by strings due to some
object collisions.

I've read from sources like:

http://groups.google.com/groups?hl=...befb&seekm=eOuZ99ctBHA.1272@tkmsftngp03#link1

that collisions in string hashes are negligible, but testing with an
english dictionary with 350.000 words the hastbale produced 64
collisions. It can seem a few but I think that 1 collision in 5000
words are too!!

The smallest code to show my problem is next:

Console.WriteLine( "{0} = {1}",
"blair".GetHashCode(),
"brainlessness".GetHashCode()
);

The Console outputs:
175803953 = 175803953

How is possible that in a space so huge like a Int32 (4 billions)
there were collision in 1 of 5000 cases? I've trying other no standar
hashing functions but all them worked worst than GetHashCode()...

I hope that I were missing some important concept. Somebody has tried
to hashing a big dictionary in a hashtable? Somebody has any clue
about what I'm doing bad?

thanks very much,

Francesc
 
J

Jochen Kalmbach

Francesc said:
I have some problem with a hashtable keyed by strings due to some
object collisions.
[...]
I hope that I were missing some important concept.

Collisions are normal in hash-tables!

I've read from sources like:

http://groups.google.com/groups?hl=ca&lr=&ie=UTF-8&oe=UTF-8&safe=off&fr
ame=right&th=5e5cc678d76cbefb&seekm=eOuZ99ctBHA.1272%40tkmsftngp03#link
1

that collisions in string hashes are negligible,

Where can you read this !?
<snip>
Collisions are unavoidable in hashtables.
</snip>



By the way: What is your problem ?

--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp


Do you need daily reports from your server ?
http://sourceforge.net/projects/srvreport/
 
J

Jon Skeet [C# MVP]

Francesc Benavent said:
How is possible that in a space so huge like a Int32 (4 billions)
there were collision in 1 of 5000 cases? I've trying other no standar
hashing functions but all them worked worst than GetHashCode()...

I don't see that it's really a problem, to be honest. A hash collision
just means that two objects need to have Equals called on them when
doing a lookup, rather than one. Collisions are part of life in
hashing. It would be *nice* if they didn't occur, but everything is
designed around them happening, just not terribly often. 1 in 5000 is
fine. In your dictionary, what was the greatest number of string which
ever produced the same hashcode? How expensive do you think it would be
to call Equals on all of them, bearing in mind that unless they're the
same length, Equals doesn't even need to look at the data itself?
I hope that I were missing some important concept. Somebody has tried
to hashing a big dictionary in a hashtable? Somebody has any clue
about what I'm doing bad?

Nothing - it's just not a problem.
 
F

Francesc Benavent

Hi Jochem,
Collisions are normal in hash-tables!

Ok, I understand it now. I thought that they were theorically possible
but statistically not likely.
Where can you read this !?
<snip>Collisions are unavoidable in hashtables.</snip>

Biased reading, I would select to read:

<snip>I did a test and did not get a collision for 30 million unique
strings</snip>

And for this I wondered that with a test a hundred times smaller I got
so "many" collisions.
By the way: What is your problem ?

The problem is that I cant index strings by its hash code, due to
collisions. Please read my next message in this thread for a
description of my real problem.

thanks,

Francesc
 
F

Francesc Benavent

Hi Jon,
It would be *nice* if they didn't occur, but everything is
designed around them happening, just not terribly often.

Yes, I'm starting to know about it... :)
what was the greatest number of string which ever produced
the same hashcode?

All of the 64 collision were just double. No hash code were shared by
3 or more strings.
just means that two objects need to have Equals called on them when
doing a lookup, rather than one.

How expensive do you think it would be to call Equals on all of them,
bearing in mind that unless they're the same length, Equals doesn't even
need to look at the data itself?

I'm not sure to understand those paragrpahs, sorry. Can you explain it
a litle more?

Nothing - it's just not a problem.

Same to you ;-), please read my next message with description of
specific problem.

thanks for your message,

Francesc
 
J

Jon Skeet [C# MVP]

Francesc Benavent said:
Yes, I'm starting to know about it... :)
:)


All of the 64 collision were just double. No hash code were shared by
3 or more strings.
Great.



I'm not sure to understand those paragrpahs, sorry. Can you explain it
a litle more?

Sure. Hashtables generally work roughly in this way:

The table has a number of buckets. Each bucket has a portion of the
hashspace. (Simple example: you might have two buckets, with one bucket
having all negative hashcodes, another having all the non-negative
ones.) This is done so you can easily find a list of items with
"roughly" the right hashcode.

In each bucket are entries. The entry contains the actual hashcode and
a reference to the key object.

When looking for something, you first find the right bucket. You then
look down the list of all the entries (which are usually ordered) and
find the entries with the right hashcode. For each such entry, you call
Equals on the key you're trying to find, passing in the the reference
to the potentially matching key (or the other way round - it shouldn't
matter).

So, having hash collisions slows things down a bit, because you need to
call Equals more times in order to find the right key. It doesn't slow
it down a *lot* though, because you still only have to call Equals on a
few items (at most two in your case, according to the bit you wrote
earlier).
Same to you ;-), please read my next message with description of
specific problem.

I haven't seen that message yet... I'll comment when I do.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top