Good hash function for permutations?

2019-03-09 09:09发布

问题:

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.

e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.

I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).

This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?

Are there any good hash functions for permutations?

UPDATE

I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):

  • 41 different hashes for just adding (so there are lots of collisions)
  • 8 different hashes for XOR'ing values together
  • 286 different hashes for multiplying
  • 3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)

So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!

PS: I do not want to sort the values when I do not have to, because this would get too slow.

回答1:

One potential candidate might be this. Fix a odd integer R. For each element e you want to hash compute the factor (R + 2*e). Then compute the product of all these factors. Finally divide the product by 2 to get the hash.

The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding that the product will ever become 0. The division by 2 at the end is because the product will always be odd, hence the division just removes a constant bit.

E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18]. The product (computed using 32-bit ints) is

(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311

Hence the hash would be

3376724311/2 = 1688362155.



回答2:

Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.

If you sort your arrays before storing them or computing hashes, every good hash function will do.

If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.



回答3:

If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.

The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.



回答4:

depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.

This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine



回答5:

I would suggest this: 1. Check if the lengths of permutations are the same (if not - they are not equal)

  1. Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).

note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.

pseudo-code:

if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;

the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.



回答6:

You can probably reduce the collisions a lot by using the product as well as the sum of the terms.

1*10*3*18=540 and 10*18*3*1=540

so the sum-product hash would be [32,540]

you still need to do something about collisions when they do happen though



回答7:

I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes. so if you first sort the array, and then generates a unique string using some delimiter.

so you can do the following (Java):

    int[] arr = selectRandomNumbers();
    Arrays.sort(arr);
    int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();

if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format

   String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);

String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare