hash function for src dest ip + port

2019-02-02 15:49发布

问题:

So, I am looking at different hash functions to use for hashing a 4 tuple ip and port to identify flows.

One I came across was

((size_t)(key.src.s_addr) * 59) ^
((size_t)(key.dst.s_addr)) ^
((size_t)(key.sport) << 16) ^
((size_t)(key.dport)) ^
((size_t)(key.proto));

Now for the life of me, I cannot explain the prime used (59). why not 31, and then why go mess it up by multiplying the sport by a power of 2. Is there a better hash function to be used for ip addresses ?

回答1:

The prime number is used because when one value is multiplied by a prime number it tends to have a higher probability of remaining unique when other similar operations are accumulated on top of it. The specific value 59 may have been choosen arbitrarily or it may be intentional. It is hard to tell. It is possible that 59 tended to generate a better distribution of values based on the most likely inputs.

The shift by 16 may be because ports are limited to the range 2^16. The function appears to be moving the source port into the higher part of the bitfield while leaving the destination port in the lower part. I think this can be explained further in my next paragraph.

Another reason why the multiplication takes place (and this is true of the shift operation as well) is because it breaks down the associative nature of the hash function. Remember, XOR is associative so the IPs src=192.168.1.1 and dst=192.168.1.254 would hash to the same value as src=192.168.1.254 and dst=192.168.1.1 (swapped) if the multiplication were not there.



回答2:

Examine the output of the function for uniform distribution. If you don't like it, plug in some different primes until you get a distribution you like. Hashing can be a very dark art with no 'right' answer.



回答3:

Personally I think you'd be better off reading the four IP bytes as an unsigned long which would give you a number roughly in the range 0 - 2^32-1. Then you figure out how many flows you want to have active at any one time and that would be your index table size.

Take 2000 for example. That means you want to map 2^32 numbers onto roughly 2^11 indeces (to flow information). That won't work because hashing almost never works if filled to 100% and even 90% can be difficult. Using a index table that you only fill to 50% (4000 indeces) or even 25% (8000) is no big deal with todays memories.

The exact size of the index table should be an uneven number of locations and preferably a prime number. This is because you'll most likely need to have some overflow handling to handle collisions (two or more ip numbers which after the hashing point to the same location in the index table) - which you'll get. The overflow handling should be another prime number less than the index table size. All these prime numbers! What's with them anyway?

I'll illustrate with an example (in C):

idxsiz = prime(2000 * 2);    // 50% loading
ovfjmp = prime(idxsiz/3);

...

initially fill the table of idxjmp positions with an UNUSED marking (-1). Have a DELETED marking ready (-2).

Your ip number enters the system and you look for its flow record (may or may not exist):

stoppos = ip % idxsiz;    /* modulo (division) just this once */
i = stoppos;
do
{
  if (index[i] == UNUSED) return NULL;
  if (index[i] != DELETED)
  {
    flowrecptr = &flow_record[index[i]];
    if (!flowrecptr->in_use) {/* hash table is broken */}
    if (flowrecptr->ip == ip) return flowrecptr;
  }
  i += ovfjmp;
  if (i >= idxsiz) i -= idxsiz;
}
while (i != stoppos);
return NULL; 

The UNUSED serves as a marker that this index has never been used and that searching should stop. The DELETED serves as a marker that this index has been used but no longer. That means that searching must continue.

This was when you were attempting to do a get. You got a NULL back from get so you need to do a put which you begin by finding the first index position containing UNUSED or DELETED. Replace this value with an index to the first/next free row on the flow_record table. Mark the row as in_use. Put the original ip number into the ip member of the flow_record row.

This is a very simple - but very effective - way to construct a hashing mechanism. Practically every optimization in the form of special functions to be used after this or that function has failed will enhance the effectiveness of the hashing.

The use of prime numbers will ensure that - in the worst case where all index locations are occupied - the mechanism will test every single location. To illustrate this: suppose idxsiz is evenly divisible by ovfjmp: you won't have much overflow handling to speak of. 35 and 7 will result in locations 0,7,14,21 and 28 being tested before the index jumps to 0 where the while test will cause the search to stop.

----------------------OOPS!

I missed that you wanted the port number as well. Assuming ip V4 that means 6 bytes of address. Read this as an unsigned 64-bit integer and clear the top 16 bits/2 bytes. Then you do the modulo calculation.



回答4:

Brian Gideon pretty much sums it up; the multiplication and the shift are intended as a symmetry breaker. So this catches the hypothetical case of machine A telnetting to machine B and vice versa and they happen to chose the same ephemeral portnum. Not very common, but not impossible. Much of the 5-tuple is pretty constant: protocol comes from a very small domain, and so does one half of {address,portnum}.

Assuming a prime-sized hashtable, the magic constant 59 could be replaced by any prime, IMO. The (port << 16) could also be replaced by another shift, as long as no bits fall off or even by a (port * some_other_prime) term.

For a power-of-two sized hashtable, all(minus one) members of the 5-tuple should be multiplied by a (different) prime. (in the old days, when division was expensive that would have been an option)