Hashing article title, will it always be unique?

D

DotNetNewbie

Hi,

I need to hash the Title of my articles, will the hashed value ALWAYS
be unique so long as the Title is unique?

Example:

string title = "microsfoft csharp news group";

int hash = title.GetHashCode();
 
R

Richard Blewett

DotNetNewbie said:
Hi,

I need to hash the Title of my articles, will the hashed value ALWAYS
be unique so long as the Title is unique?

Example:

string title = "microsfoft csharp news group";

int hash = title.GetHashCode();

No - a Hash is guaranteed to be the same for the same input but is not
guaranteed unique.

Think about it this way GetHashCode returns int32 - there are more than 4
billion possible strings

--
Regards

Richard Blewett
DevelopMentor
http://www.dotnetconsult.co.uk/weblog2
 
O

Oddball

Hi,

I need to hash the Title of my articles, will the hashed value ALWAYS
be unique so long as the Title is unique?

Example:

string title = "microsfoft csharp news group";

int hash = title.GetHashCode();

Hashes just need a wide phase space - low probability of collision.
Not unique - just very very very very unlikly to collide.

Now someone corrects me :D
 
N

Nicholas Paldino [.NET/C# MVP]

There are two things to take into consideration when doing this. The
first is the size of the hash. If the size of the hash can account for all
the values that are allowed for a certain set of values, then the hash ^may^
be unique.

The second part is the algorithm. It's completely possible that the
algorithm could produce equal hash values for separate values even when the
number of values will fit within the entire range of the values that the
hash is represented in. Of course, one might argue it's not a very ^good^
algorithm, but it's possible.

A good example is the Boolean data type and using a hash code returned
by GetHashCode. Now, the Boolean data type has two possible values, while
the Int type (which is what GetHashCode returns) has 2^32 possible values.
Given this, a hash code for the Boolean type ^should^ be unique. However a
hashing algorithm is under no obligation required to return different hash
codes for different values in the boolean data type. It's very possible
that GetHashCode can return 0 for both true and false.

So the answer is no. There are other algorithms that will produce
longer hashes (SHA I think will generate a 512 bit hash) but the algorithm
is not required to produce different hash values for different values passed
into it, even if the number of values in the hash value output type can
represent all possible values in the input type.
 
M

Michel Walsh

With a table from a database (MS SQL Server). Your table has two fields, the
primary key, an autonumber automatically generated, and the second field, a
nvarchar(255) with an index unique but with ignore_dup_key on.

Insert the new title in the table.

INSERT INTO hashTable(secondFieldName) VALUES('microsoft csharp news
group')


If the title does not exist, it will be appended. If the title exists, it
will be rejected (unique constraint). You can also insert a batch of titles:

INSERT INTO hashTable(secondFieldName) SELECT titles FROM otherTable


and the new titles will be appended, while the existing ones will be
rejected (that is because of the ignore_dup_key set to on; if set to false,
all the 'batch' would be rejected).


Next, to get the hash number, make a 'lookup' over the title, returning the
primary key.


SELECT primaryKey FROM hashTable WHERE secondFieldName='microsoft csharp
news group'


That returns the long integer value that satisfy your definition, even if it
is not a hash value 'per se'. Since the second field is also indexed,
finding the hash value should not be really too slow, even if the hashTable
is large. If two titles are the same, they will return the same 'hash
value', since only one sample of the title would have been saved in the
table (due to the UNIQUE constraint).



Vanderghast, Access MVP.
 
R

rossum

So the answer is no. There are other algorithms that will produce
longer hashes (SHA I think will generate a 512 bit hash) but the algorithm
is not required to produce different hash values for different values passed
into it, even if the number of values in the hash value output type can
represent all possible values in the input type.
SHA is a series of cryptographic hashes. SHA256 generates a 256 bit
hash, and SHA512 generates a 512 bit hash. Both of these are a lot
slower than the standard getHash() method. SHA384 is not worth using
since it calculates the full SHA512 hash and throws part of it away.
The bigger the hash the smaller the chance of a collision.

For a slightly faster, but shorter, hash try MD5 at 128 bits. This is
no longer cryptographically secure but can still be used for
non-cryptographic purposes.

rossum
 
F

Fredo

I think others have more or less answered your question, but I think you
might get better information if you describe more precisely what it is
you're trying to do.

There are a number of different hash algorithms. I've read of various
complaints about the default implementation in string(), but I don't know
the details.

What it really comes down to is, what is it you're trying to do? From that,
we might be able to provide you with a more detailed answer and possibly
some links to further reading.
 
D

DotNetNewbie

I think others have more or less answered your question, but I think you
might get better information if you describe more precisely what it is
you're trying to do.

There are a number of different hash algorithms. I've read of various
complaints about the default implementation in string(), but I don't know
the details.

What it really comes down to is, what is it you're trying to do? From that,
we might be able to provide you with a more detailed answer and possibly
some links to further reading.

Well basically I want to put the Hash of a Article Title in the
database, so I can then fetch the article using the Hash instead of by
the Title (since the hash is an integer lookups will be faster).

My Title can be alphanumerical only plus dashes and underscores.

So I have to know if this hash will be unique as long as the Title is
unique.
 
L

Lasse Vågsæther Karlsen

DotNetNewbie said:
Well basically I want to put the Hash of a Article Title in the
database, so I can then fetch the article using the Hash instead of by
the Title (since the hash is an integer lookups will be faster).

My Title can be alphanumerical only plus dashes and underscores.

So I have to know if this hash will be unique as long as the Title is
unique.

This was answered, and the answer is no.

Re-read the other answers to get more information, but with a normal
fixed-size hash, you cannot rely on a hash being unique, even if you
happen to see only unique values for your current set of test-titles.

Use an index on the title, and it'll be fast enough.
 
J

Jon Skeet [C# MVP]

DotNetNewbie said:
Well basically I want to put the Hash of a Article Title in the
database, so I can then fetch the article using the Hash instead of by
the Title (since the hash is an integer lookups will be faster).

That's trying to second guess the database. Just insert the article
title itself, and index the column. The database will deal with it
appropriately.
My Title can be alphanumerical only plus dashes and underscores.

So I have to know if this hash will be unique as long as the Title is
unique.

Not only will it not be unique, but there's no guarantee that you'll
get the same result the next time, if you use GetHashCode(). The
algorithm of String.GetHashCode() has already changed between .NET 1.1
and 2.0 - there's no guarantee that it won't change again. It's bitten
people before who've done exactly what you've done.

If you use a fixed hash algorithm (either one you've written yourself,
or a specific one like MD5) you won't run into that issue - but you
still won't be guaranteed to get unique values.

As I said before, I suggest you index the title column and let the
database handle it.
 
N

Nicholas Paldino [.NET/C# MVP]

rossum,

The thing is, it won't satisfy the OP's issue, which is the need for the
hashcode to be unique. You can't use hashcodes to generate ^unique^ values.
A hash code generator can't guarantee that for all cases.
 
R

rossum

rossum,

The thing is, it won't satisfy the OP's issue, which is the need for the
hashcode to be unique. You can't use hashcodes to generate ^unique^ values.
A hash code generator can't guarantee that for all cases.
Agreed, any hash that is shorter than the longest possible title will
be unable to guarantee no collisions. However the longer,
cryptographic sized, hashes will tend to have fewer collisions than
shorter hashes. The problem will be reduced, but not eliminated.
This may or may not be worth the extra work done in computing a longer
hash.

rossum
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top