I want to create a Hash Map (or another structure, if you have any suggestions) to store key value pairs. The keys will all be inserted at once at the same time as the map is created, but I don't know what the keys will be (arbitrary length strings) until runtime, when I need to create the map.
I am parsing a query string like this "x=100&name=bob&color=red&y=150"
(but the string can have an unlimited number of variables and the variables can have any length name).
I want to parse it once and create a Hash Map, preferably minimal and with a perfect hash function to satisfy linear storage requirements. Once the map is created the values won't be modified or deleted, no more key value pairs will be added to the map either, so the entire map is effectively a constant. I'm assuming that a variable doesn't occur twice in the string (IE. "x=1&x=2"
is not valid).
I am coding in C
, and currently have a function that I can use like get("x")
which will return the string "100"
, but it parses the query string each time which takes O(n)
time. I'd like to parse it once when it is first loaded since it is a very large query string and every value will be read several times. Even though I'm using C
, I don't need code in C
as an answer. Pseudocode, or any suggestions at all would be awesome!
There are some very good hashing routines; however, proving one of them to be near-perfect requires a lot of knowledge of the inputs. It seems that your inputs are unconstrained enough to make such a proof near-impossible.
Generally speaking a perfect (or near-perfect) routine is sensitive to each bit/byte of input. For speed, the combination operation is typically XOR. The way that such routines prevent two identical bytes from cancelling each other out is to shift or rotate the bits. However such shifting should be done by a number that is a relative prime to the maximum number that can be represented; otherwise, patterns in the input could partially be cancelled by previous input. This reduces entropy in the solution, increasing chance of collision.
The typical solution is to
The problems with such a routine are known. Basically there is a lack of variation in the input, and this makes dispersing the input non-ideal. That said, this technique gives a good dispersion of input bits across the entire domain of outputs provided there is sufficient input to wander away from the initial prime starting number. Unfortunately, picking a random starting number is not a solution, as then it becomes impossible to accurately recompute the hash.
In any case, the prime to be used in the multiplication should not overflow the multiplication. Likewise the capturing of high-order bits must be replaced in the low order if you want to avoid losing dispersion effects of the initial input (and the result becoming grouped around the latter bits / bytes only). Prime number selection effects the dispersion, and sometimes tuning is required for good effect.
By now you should easily be able to see that a near-perfect hash takes more computational time than a decent less-than-near-perfect hash. Hash algorithms are designed to account for collision, and most Java hash structures resize at occupancy thresholds (typically in the 70% range, but it is tunable). Since the resizing is built in, as long as you don't write a terrible hash, the Java data structures will continue to retune you into having less of a chance of collision.
Optimizations which can speed a hash include computing on groups of bits, dropping the occasional byte, pre-computing lookup tables of commonly used multiplied numbers (indexed by input), etc. Don't assume that an optimization is faster, depending on architecture, machine details, and "age" of the optimization, sometimes the assumptions of the optimization no longer hold and applying the optimization actually increases the time to compute the hash.
if you know the set of all possible variable names, then it would be possible to use to perfect hash the names to numbers
but each of the hash tables would end up having the same length an example is if
X
andy
are the names then the map would always be of length 2if
perfect(str)
turns'x'
and'y'
into 0 and 1; then the functionget
would beThere's no such thing as a perfect hash in what you're describing. A perfect hash would be the original input. If you're guaranteed that your data will only be certain things (such as latin based ASCII or only certain keys) then you can hash well, but perfect? No. Not possible. You have to create a link-list or vector hash miss mechanism as well. Any varient in the system (like count of inputs in your case) will invalidate the perfect hash concept.
What you want defies the laws of math.
You can achieve near O(1) but there's unanswered questions here. The questions are:
Although a perfect hash isn't possible, it becomes entirely academic if you can simply have a simple linked list with a bucket size that is at least two standard deviations out from the mean of your potential unique hashes. It's minimal memory (relatively speaking of course and depending on total potential size), deletion friendly, and would be nearly O(1) lookup time as long as question 3 is answered something like, "far smaller".
The following should get you started but I'll leave decisions about which hash algorithm to use up to you...
Usage examples (as assertions) and efficiency tests. Using
int
as the data value type...Additionally I did some tests using 100,000 randomly generated ASCII keys with lengths between 5 and 1000 characters that showed the following...
As you can see, it has the potential to perform quite well. An efficiency of 80% means that approximately 80% of the lookups are O(1), about 16% of the lookups are O(2), about 3.2% of the lookups are O(3), and about 0.8% of lookups are O(4+). This means that on average a lookup would take O(1.248)
Likewise, an efficiency of 50% means that 50% of lookups are O(1), 25% are O(2), 12.5% are O(3), and 12.5% are O(4+)
You really just need to pick (or write) the right hashing algorithm for your known factors and tweak things for your specific needs.
Notes:
move()
,swap()
,sort()
,insert()
, etc by managingentry->prev
andentry->next
Try GPL'd gperf, or Bob Jenkins' public domain implementation in C
Procedure:
receive query string and identify domain of perfect hash function by enumerating the list of keys
provide these keys and list size (the range will be 1..size) to the perfect hash generation function derived from above reference implementations
Use the perfect hash function generated to create the HashMap
Use the same perfect hash function to process the
get
requests in the HashMapEdit Necrolis noted in the comment below that the reference implementations output perfect hash functions in C source code, so you'll need to modify them to generate something like a bytecode for a VM instead. You could also use an interpretative language like embedded Scheme or Lua.
It would be interesting to know if this is worth the effort over a simple (non-perfect) HashMap when the overhead of creating the perfect hash function is amortized over the lookups
Another option is Cuckoo hashing which also has O(1) lookups