Generating a Hash

S

Stan

Is it possible to hash a 100 bytes string to a integer? I found a few .NET
classes for that
such as Sha1Managed.ComputeHash but they return bytes. I am just not sure
about the idea of converting 100 bytes to four or eight without loosing
uniqueness.

The issue has come up because I am storing bills with customers in database
and I would like to reuse customers, so that not every bill has its own
customer. In order to do that I need to make a some sort of unique code for
each customer based on name, address, city, state, zip. I want to use the
whole customer name, because very often there are customers in the same city
with the name only different in the last few characters.

Thanks,

-Stan
 
J

Jon Skeet

Stan said:
Is it possible to hash a 100 bytes string to a integer? I found a few .NET
classes for that such as Sha1Managed.ComputeHash but they return bytes.

Sure, but you can convert 4 bytes into an integer or 8 bytes into a
long pretty easily.
I am just not sure about the idea of converting 100 bytes to four or
eight without loosing uniqueness.

Well obviously you can't do that - there are far more sequences of 100
bytes than there are of 4 or 8.
The issue has come up because I am storing bills with customers in database
and I would like to reuse customers, so that not every bill has its own
customer. In order to do that I need to make a some sort of unique code for
each customer based on name, address, city, state, zip. I want to use the
whole customer name, because very often there are customers in the same city
with the name only different in the last few characters.

Rather than assume the hash code itself is identical, just assign each
customer a unique ID and look it up based on name, address, city, state
and zip when you need to retrieve it.
 
S

Stan

Rather than assume the hash code itself is identical, just assign each
customer a unique ID and look it up based on name, address, city, state
and zip when you need to retrieve it.

Then I will have this query:

select * ..... from ... where name = @name and address = @address and city
= @city
and state = @state and zip = @zip

It is by far more efficient to have

select * ..... from ... where code = @code
 
G

Guinness Mann

select * ..... from ... where name = @name and address = @address and city
= @city
and state = @state and zip = @zip

It is by far more efficient to have

select * ..... from ... where code = @code

I don't think you quite understand what a hash is, Stan. Hashes are not
guaranteed to be unique. They're just a way of localizing sparse data.
You *always* have to check for collisions with a hash.

As Jon mentioned, how could you possibly generate unique 8 (or 4) byte
values for each possible value of a 100-byte string? Think about it.

Why not look up "hashing with linear probing" to see a possible solution
for your problem.

-- Rick
 
J

Jon Skeet

Stan said:
Then I will have this query:

select * ..... from ... where name = @name and address = @address and city
= @city
and state = @state and zip = @zip

It is by far more efficient to have

select * ..... from ... where code = @code

Sure - if you don't mind the fact that your code won't necessarily be
unique...

Of course, it's *unlikely* that you'll get a hash collision, if you
only have a few thousand entries - but that may not be good enough.

(What you could do is search by hash and then verify each field
separately, of course.)
 
S

Stan

I don't think you quite understand what a hash is, Stan. Hashes are not
guaranteed to be unique. They're just a way of localizing sparse data.
You *always* have to check for collisions with a hash.

Yes, I thought hash is guaranteed to be unique - similar to when NT encrypts
user's passwords and stores them as hash...

What I probably need is not hashing but compressing or compacting
name+address+city+
state+zip. Even without spaces I end up with 100-150 characters... There is
got to be some algoritms that do that (similar to ZIP, ARJ, etc)...
 
J

Jon Skeet

Stan said:
Yes, I thought hash is guaranteed to be unique - similar to when NT encrypts
user's passwords and stores them as hash...

That doesn't guarantee it to be unique, I rather suspect. One way
hashes like that are basically used so that an attacker has a *very,
very small* chance of getting access without having the right password,
and the password itself doesn't need to be stored in plain text.
What I probably need is not hashing but compressing or compacting
name+address+city+
state+zip. Even without spaces I end up with 100-150 characters... There is
got to be some algoritms that do that (similar to ZIP, ARJ, etc)...

Well, hashing would be a good start, if you wanted something small to
search on: write a hash into your database (and make sure it's up to
date!) but having retrieved results by hashcode, check that you get the
right record (by the individual fields) before doing anything else.

Note that although compression algorithms like zip etc will *usually*
save space, there's no guarantee that they will - and there *can't* be,
for exactly the same reason you can't get a unique hash when you're
going from x bytes to y bytes and y is smaller than x.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top