I am trying to do this for a data structure module for my degree,
writing our own data structures is important for best marks. More marks
are to be gained for good performance. Up to now from the data
structures i have studied this seems to be this best for this task. What
you do recommend for a better ?
Caveat: I don't know the specific grading standard, how "performance" is
defined (does memory usage matter?), and frankly I don't care. I already
have mixed feelings about assisting with what's essentially homework.
That said, if memory consumption and set-up time is no object, a state
graph _could_ be better (I think some people reading this may be starting
to think, "what's up with that guy and his state graphs?"...I know, they
aren't really the end-all be-all, but I like them anyway

).
With a hash table, you are guaranteed to have to scan the string at least
once, because even if there's only one string associated with a specific
hash value, you need to check to make sure it's the one you're actually
looking for (I'm assuming here that you are not guaranteed ahead of time
that the string being looked up is actually in your dictionary).
Using a state graph, the worst-case scenario is having to scan the entire
string, and if you fail to match, or you match a string that is unique
earlier than the last character (a common enough scenario), then searching
for a string requires only a single partial scan of the string.
Why did I write "could"? Well, in reality the implementation of the state
graph depends.
For each node in the state graph, you need a way of mapping a new
character to the next node. If you're guaranteed only regular ASCII
characters, then this requires only a 128-entry lookup table (less,
actually, if you exclude control characters and maybe others too). This
requirement could come out of the basic dictionary contents, or you could
simplify the dictionary by removing accents, etc. and only supporting
languages using the Roman alphabet. But if you want to be more flexible,
then you're looking at some other data structure for the node lookup, and
that could start adding in cost again (for example, one easy
implementation might use a dictionary, which is essentially another hash
table

).
If you can afford to burn 256K per state graph node, then the next-node
lookup is a single constant-order array retrieval for Unicode. But
otherwise, you need something that's going to be slower.
In addition, traversal of the state graph could wind up visiting memory
locations that are not proximal to each other, causing less efficient use
of the CPU cache than might occur with a regular hash table.
Of course, the hash table might not be able to be implemented in a
perfectly optimal way either. You could have a lot of collisions, either
because the hash function just isn't capable of reducing collisions, or
because you don't really have 2 million+ entries in the hash table.
Each implementation has its pros and cons and it's not really possible to
say in advance which would come out ahead.
Finally, these are sort of the standard fare for data structures classes.
There are a number of other possible data structures that are much more
advanced (i.e. complicated) and they may perform better in specific
scenarios (they would take advantage of constraints you are able to apply
to the problem, assuming there are any). People have been writing word
dictionaries for decades, and there are countless implementations, each
with varying performance and appropriateness for a given data set.
Your biggest problem with the approach you've described is that I just
don't know of a practical way to come up with a hash function that
guarantees no collisions. That's the only way you can do what you're
asking. I haven't looked closely at this myself, but I do believe it's
mathematically _possible_ to write a hash function, generated based on the
known input data, that will guarantee no collisions, but I suspect that
hash function is not practical to implement (it'd probably wind up be some
incredibly-large-order polynomial or something, with magic constants
derived from the input data).
Pete