Hash Code Of A String

  • Thread starter Thread starter j1mb0jay
  • Start date Start date
J

j1mb0jay

I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. I am currently using Java to
do this but would like to move the source to C#. Java's String class has
a method that hashes strings. I was wondering if C# has a method which
does the same?

In my Java version of the program I am using the Multiply Add and Divide
(MAD) method for the compression of the hash code, does C# have any
built in functions(methods) that will do this for me, or does anyone
know of a more efficient way?

j1mb0jay
 
All .NET types have the GetHashCode() method, at minimum inherited from
System.Object.

string str = "fubar";

int hc = str.GetHashCode(); // 418978654
 
KH said:
All .NET types have the GetHashCode() method, at minimum inherited from
System.Object.

string str = "fubar";

int hc = str.GetHashCode(); // 418978654

Much the same as Java the GetHashCode() methodology. How would you
convert the 418978654 to a number that I could use as an index into a
fixed size array?

j1mb0jay
 
That's really a whole different question; a couple options:

1) GetHashCode() returns a 32-bit interger, so if you created an array of
size UInt32.MaxValue you could use the hashcode directly

2) You could % the hashcode by the size of your array

Neither of those are actually viable options of course. What are you
actually trying to accomplish?
 
How much memory would option one take on a 32bit machine !!, I only have
access to about 8GB physical and 18GB virtual.

I am trying to populate a dictionary that has standard English words but
is extended by the slang and dialog found on new groups. This dictionary
will then aid with my coursework which is to crack a few simple
encryption types such as polyalphabetic.

I also want to order the dictionary by the popularity of the words found
on the news server. I have successfully completed this by loading the
dictionary file into a sorted binary tree.

I just wanted to improve the insert method of the hash table because
after inserting all words I have to check a maximum 4 words rather than
1 to see if the word already exists in the hash table which is
2.5million words long.
 
Much the same as Java the GetHashCode() methodology. How would you
convert the 418978654 to a number that I could use as an index into a
fixed size array?

And what would the point of that be?

..NET already has collection classes that use the hash code directly (e.g.
Dictionary, Hashtable, etc.).

You can always just use the % operator to take an arbitrary number and map
it to some arbitrarily smaller range. But I don't see the point in this
case.

Pete
 
[...]
I just wanted to improve the insert method of the hash table because
after inserting all words I have to check a maximum 4 words rather than
1 to see if the word already exists in the hash table which is
2.5million words long.

There's no way for you to guarantee that you don't have to check multiple
words, even if you have as many entries in the hash table as you have data
elements. Collisions are always a possibility.

Besides that, you seem to be pursuing a pointless optimization. Even if
your hash table required you to check dozens of words, it would still be
ridiculously faster than a linear search of 2.5 million words. That's the
point of a hash table. You may be able to improve performance slightly by
implementing your own hash table and/or hash function, but IMHO once
you've got to the point of use _some_ kind of hash table, you're likely to
find the algorithm isn't going to get a lot faster by messing around with
the hash table. Your efforts will be better spent looking for other
low-hanging fruit elsewhere in whatever the algorithm you're doing is.
And if you can't get acceptable performance that way while still using a
hash table, you may wind up needing an algorithm that uses a data
structure that's even faster than the hash table.

Pete
 
Peter said:
[...]
I just wanted to improve the insert method of the hash table because
after inserting all words I have to check a maximum 4 words rather
than 1 to see if the word already exists in the hash table which is
2.5million words long.

There's no way for you to guarantee that you don't have to check
multiple words, even if you have as many entries in the hash table as
you have data elements. Collisions are always a possibility.

Besides that, you seem to be pursuing a pointless optimization. Even if
your hash table required you to check dozens of words, it would still be
ridiculously faster than a linear search of 2.5 million words. That's
the point of a hash table. You may be able to improve performance
slightly by implementing your own hash table and/or hash function, but
IMHO once you've got to the point of use _some_ kind of hash table,
you're likely to find the algorithm isn't going to get a lot faster by
messing around with the hash table. Your efforts will be better spent
looking for other low-hanging fruit elsewhere in whatever the algorithm
you're doing is. And if you can't get acceptable performance that way
while still using a hash table, you may wind up needing an algorithm
that uses a data structure that's even faster than the hash table.

Pete

I am trying to do this for a data structure module for my degree,
writing our own data structures is important for best marks. More marks
are to be gained for good performance. Up to now from the data
structures i have studied this seems to be this best for this task. What
you do recommend for a better ?

j1mb0jay
 
I am trying to do this for a data structure module for my degree,
writing our own data structures is important for best marks. More marks
are to be gained for good performance. Up to now from the data
structures i have studied this seems to be this best for this task. What
you do recommend for a better ?

Caveat: I don't know the specific grading standard, how "performance" is
defined (does memory usage matter?), and frankly I don't care. I already
have mixed feelings about assisting with what's essentially homework.

That said, if memory consumption and set-up time is no object, a state
graph _could_ be better (I think some people reading this may be starting
to think, "what's up with that guy and his state graphs?"...I know, they
aren't really the end-all be-all, but I like them anyway :) ).

With a hash table, you are guaranteed to have to scan the string at least
once, because even if there's only one string associated with a specific
hash value, you need to check to make sure it's the one you're actually
looking for (I'm assuming here that you are not guaranteed ahead of time
that the string being looked up is actually in your dictionary).

Using a state graph, the worst-case scenario is having to scan the entire
string, and if you fail to match, or you match a string that is unique
earlier than the last character (a common enough scenario), then searching
for a string requires only a single partial scan of the string.

Why did I write "could"? Well, in reality the implementation of the state
graph depends.

For each node in the state graph, you need a way of mapping a new
character to the next node. If you're guaranteed only regular ASCII
characters, then this requires only a 128-entry lookup table (less,
actually, if you exclude control characters and maybe others too). This
requirement could come out of the basic dictionary contents, or you could
simplify the dictionary by removing accents, etc. and only supporting
languages using the Roman alphabet. But if you want to be more flexible,
then you're looking at some other data structure for the node lookup, and
that could start adding in cost again (for example, one easy
implementation might use a dictionary, which is essentially another hash
table :) ).

If you can afford to burn 256K per state graph node, then the next-node
lookup is a single constant-order array retrieval for Unicode. But
otherwise, you need something that's going to be slower.

In addition, traversal of the state graph could wind up visiting memory
locations that are not proximal to each other, causing less efficient use
of the CPU cache than might occur with a regular hash table.

Of course, the hash table might not be able to be implemented in a
perfectly optimal way either. You could have a lot of collisions, either
because the hash function just isn't capable of reducing collisions, or
because you don't really have 2 million+ entries in the hash table.

Each implementation has its pros and cons and it's not really possible to
say in advance which would come out ahead.

Finally, these are sort of the standard fare for data structures classes.
There are a number of other possible data structures that are much more
advanced (i.e. complicated) and they may perform better in specific
scenarios (they would take advantage of constraints you are able to apply
to the problem, assuming there are any). People have been writing word
dictionaries for decades, and there are countless implementations, each
with varying performance and appropriateness for a given data set.

Your biggest problem with the approach you've described is that I just
don't know of a practical way to come up with a hash function that
guarantees no collisions. That's the only way you can do what you're
asking. I haven't looked closely at this myself, but I do believe it's
mathematically _possible_ to write a hash function, generated based on the
known input data, that will guarantee no collisions, but I suspect that
hash function is not practical to implement (it'd probably wind up be some
incredibly-large-order polynomial or something, with magic constants
derived from the input data).

Pete
 
Back
Top