可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Supposing simple uniform hashing, that being, any given value is equally like to hash into any of the slots of the hash. Why is it better to use a table of size 127 and not 128? I really don't understand what's the problem with the power of 2 numbers. Or how it actually makes any difference at all.
When using the division method,
we usually avoid certain values
of m (table size). For example, m
should not be a power of 2, since if m
= 2^p , then h(k) is just the p lowest-order bits of k.
Let's suppose the possible elements are only between 1 and 10000 and I picked the table size as 128. How can 127 be better?
So 128 is 2^6 (1000000) and 127 is 0111111. What difference does this make? All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too. Did I get something wrong?
I'm looking for some examples as I really can't understand why is this bad. Thanks a lot in advance!
PS: I am aware of:
Hash table: why size should be prime?
回答1:
All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too.
That is wrong (or I misunderstood..). k % 127
depends on all bits of k. k % 128
only depends on the 7 lowest bits.
EDIT:
If you have a perfect distribution between 1 and 10,000. 10,000 % 127
and 10,000 % 128
both will turn this in a excellent smaller distribution. All buckets will contain 10,000 /128 = 78 (or 79) items.
If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)
Thus, cutting off the high bits (using a size of 128) is no problem whatsoever if the distribution in the lower bits is good enough. But, with real data and real badly designed hash functions, you will need those high bits.
回答2:
Division Method
"When using the division method, we usually avoid certain values of m
(table size). For example, m should not be a power of 2
, since if m =
2p
, then h(k)
is just the p
lowest-order bits of k
."
--CLRS
To understand why m = 2p
uses only the p
lowest bits of k
, you must first understand the modulo hash function h(k) = k % m
.
The key can be written in terms of a quotient q
, and remainder r
.
k = nq + r
Choosing the quotient to be q = m
allows us to write k % m
simply as the remainder in the above equation:
k % m = r = k - nm, where r < m
Therefore, k % m
is equivalent to continuously subtracting m
a total of n
times (until r < m
):
k % m = k - m - m - ... - m, until r < m
Lets try hashing the key k = 91
with m = 24 = 16
.
91 = 0101 1011
- 16 = 0001 0000
----------------
75 = 0100 1011
- 16 = 0001 0000
----------------
59 = 0011 1011
- 16 = 0001 0000
----------------
43 = 0010 1011
- 16 = 0001 0000
----------------
27 = 0001 1011
- 16 = 0001 0000
----------------
11 = 0000 1011
Thus, 91 % 24 = 11
is just the binary form of 91
with only the p=4
lowest bits remaining.
Important Distinction:
This pertains specifically to the division method of hashing. In fact, the converse is true for the multiplication method as stated in CLRS:
"An advantage of the multiplication method is that the value of m is not critical... We typically choose [m] to be a power of 2 since we can then easily implement the function on most computers."
回答3:
Nick is right that in general, the hash table size doesn't matter. However, in the special case where open addressing with double hashing is used (in which the interval between probes is computed by another hash function) then a prime number-sized hash table is best to ensure that all hash table entries are available for a new element (as Corkscreewe mentioned.)
回答4:
First off, it's not about picking a prime number. For your example, if you know your data set will be in the range 1 to 10,000, picking 127 or 128 won't make a difference bc it's a poor design choice.
Rather, it's better to pick a REALLY large prime like 3967 for your example so that each data will have its own unique key/value pair. You just want to also minimize collisions. Picking 127 or 128 for your example won't make a difference bc all 127/128 buckets will be uniformly filled (this is bad and will degrade the insertion and lookup run time O(1) to O(n)) as opposed to 3967 (which will preserve the O(1) run times)
EDIT #4
The design of the "hash function" is
somewhat of a black art. It can be
highly influenced by the data that's
intended to be stored in the
hashing-based data structure, so the
discussion on a sensible hashing
function can often stray into a
discussion about specific inputs.
As why primes are "preferred", one has
to consider an "adversary" analysis,
that is suppose I designed a general
hashing-based data structure, how
would it perform given the worst input
from an adversary. Since performance
is dictated by hashing collisions the
question becomes what's the hash to
use that minimizes collision in the
worst condition. One such condition is
when the input are always numbers
divisible by some integer, say 4. If
you use N = 128 then any number
divisible by 4 mod 128 is still
divisible by 4, which means only
buckets 4, 8, 12, ... are always ever
used, resulting in 25% utilization of
the data structure. Primes effectively
reduces the likelihood of such
scenario occurring, with numbers > N.
回答5:
If you have a perfect hash function that has an even distribution, then it doesn't matter.
回答6:
Wikipedia actually has a good summary of this:
http://en.wikipedia.org/wiki/Hash_table
They point out that some hash functions are designed to operate ONLY with prime numbers. This article explains why powers of two are bad:
http://www.concentric.net/~Ttwang/tech/primehash.htm
回答7:
I cannot prove it anymore, although I remember having to do so in an exam at university a million years ago, but optimal hash sizes are not merely prime. You want to pick a prime number N such that N = 4*M − 1
(where M is also an integer).
That makes 31 a better number of buckets than 29. M is 8 when N is 31, but there is no integral M when N is 29.
As I said, I no longer remember the math to prove this. It was in a theory course taught by Rachel Manber, Udi’s wife, about 25 years ago or so.
回答8:
here is a way to understand " k % 127 depends on all bits of k. k % 128 only depends on the 7 lowest bits." .
k % 128 is equals to k & (2^7-1) .for example: 129 % 128 = 1 , In Binary: 1000 0001 & 0111 1111 =0000 0001,any hight bit of (2^7-1) will be 0 ,which means it dose not matter whats the high position is. but this translate is invalid for numbers which are not equals 2^n.
now let's take a look at how we do division in Decimal 129 % 127 ,first look at the highest position 1,less than 127,then we get the next item 2 combine with fist we get 12 , 12 is less than 127,then combine with 9 which means 129 ,divided by 127 the remainder is 2,we could write this in math:129 = 1 * 127 +2 ,so we got 2 [all of this is called Long_division] ,and it's the same in Binary division,now ,we know k % 127 depends on all bits of k
回答9:
I believe that it just has to do with the fact that computers work
with in base 2. Something similar happens with base 10.
...
Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than
a subset of them.
From Why hash tables should use a prime-number size.