The domain of interest is string matching. Assume I have a structure like this.
typedef struct
{
char *name,
int (*function)();
} StringArray
StringArray s[] =
{
{"George", func1},
{"Paul", func2},
{"Ringo", func3},
{"John", func4},
{"", NULL} /* End of list */
}
There are a fixed number of strings in the array. They are hard coded as in the example. If the table changes, there would be a need to re-evaluate the quality of the hash function.
I want to apply a hash function to a string, and if the string matches one in the array, then call the function. A perfect hash function is needed for this. No collisions are allowed.The purpose of requiring hashing is to get O(1) performance on the lookup.
What ideas do you have on designing a function to do this?
Well, there is no perfect hash function.
You have several that minimize collisions, but none eliminates them.
Can't advise one though :P
EDIT: The solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.
The summary lists both C and C++. Which of them are you looking for? C and C++ are two distinct languages, and differ greatly in their string handling and data structures (and the fact that the C ones work in C++ doesn't change that).
Why, specifically, do you want a perfect hash function? Is it that you want to associate a string with a function, and thought that would be a good way to do it? Is this some sort of homework assignment? Do you have a reason not to use map<> in C++? (Or unordered_map<> if available?)
If you do need a perfect hash, what are the constraints on the strings? Will there be a certain fixed set you want to dispatch on? What about strings that don't match one of the set? Are you willing to accept hits from random strings, or is the number of incoming strings limited?
If you could edit your question to include information like that, we could be a lot more helpful.
EDIT (in response to the first two comments):
OK, we should look at C solutions, since you presumably want this for both your C and C++ work. You presumably want the performance, but have you tested? If we're dealing with strings coming in on the I/O system, the time there is likely to dwarf the dispatch time.
You are expecting arbitrary strings. It's a bit much to expect a perfect hash function that will avoid all collisions from random data, so you do need to consider that.
Have you considered a trie? It may be more efficient than a perfect hash function (or may not be), it should be fairly easy to implement in C, and it will avoid problems with redoing your list of dispatching strings or possible collisions.
The final result of this exercise was to
For the set of arrays I have in my domain, this seems to work very very well. A possible future optimization would be to do the same sort of testing, on substrings of the input. In the sample case, the first letter of each musicians name is sufficient to tell them apart. Then one would need to balance the cost of the actual hash function against the memory used.
My thanks to everyone who contributed ideas.
Evil
If collisions are absolutely not allowed, your only option is to keep track of every string in the database, which is probably not a best way to go.
What I would do is apply one of existing common strong hashing algorithms, such as: MD5 or SHA. There's miriads of samples all around, here's one for example: http://www.codeproject.com/KB/security/cryptest.aspx
You can use map
Use a balanced binary tree. Then you KNOW behaviour is ALWAYS O(logn).
I strongly dislike hashes. People don't realise how much risk they take with their algorithm. They run some test data and then deploy in the field. I have NEVER seen a deployed hash algorithm get checked for behaviour in the field.
O(log n) is almost always acceptable in lieu of O(1).