Hash function on list independant of order of item

2019-02-23 14:32发布

I want to have a dictionary that assigns a value to a set of integers.

For example key is [1 2 3] and value will have certain value.

The thing is that [3 2 1] needs to be treated the same in my case so hash needs to be equal, if I go with hash approach.

The set will have 2 to 10 items.

Sum of items is usually fixed so we cannot make hashcode according to sum, which is a first natural idea here.

Not a homework task, actually facing this problem in my code.

This set is basically IEnumerable<int> in C# so any data structure is fine to store them.

Any help appreciated. Performance is pretty important here too.

An immediate thought: we could sum up items^2 and already get some kind of better hash, but still I would like to hear some thoughts.

EDIT: hmm really sorry guys, everyone suggests ordering, didn't come to my mind that I needed to say that actually ordering and hashing is the current solution I use and I am considering faster alternatives.

9条回答
ら.Afraid
2楼-- · 2019-02-23 15:11

Use a HashSet<T> and HashSet<T>.CreateSetComparer(), which returns an IEqualityComparer<HashSet<T>>.

查看更多
唯我独甜
3楼-- · 2019-02-23 15:12

I think what is mentioned in this paper would definitely help:

http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking

Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary finite size to strings (hashes) of fixed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is difficult to find two multisets which produce the same hash, or just set-collision resistant in that it is difficult to find a set and a multiset which produce the same hash.

查看更多
家丑人穷心不美
4楼-- · 2019-02-23 15:18

Create your own type that implements IEnumerable<T>.

Override GetHashCode. In it, sort your collection, call and return ToArray().GetHashCode().

查看更多
Emotional °昔
5楼-- · 2019-02-23 15:19

If the range of the values in key happens to be limited to low-ish positive integers, you could map each one to a prime number using a simple lookup, then multiply them together to arrive at the value.

Using the example in the question:

[1, 2, 3] maps to 2 x 3 x 5 = 30
[3, 2, 1] maps to 5 x 3 x 2 = 30
查看更多
相关推荐>>
6楼-- · 2019-02-23 15:20

Why not something like

public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
    return (source.Select(x => x*x).Sum()
            + source.Select(x => x*x*x).Sum()
            + source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}
查看更多
Lonely孤独者°
7楼-- · 2019-02-23 15:25

Basically all of the approaches here are instantiations of the same template. Map x1, …, xn to f(x1) op … op f(xn), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.

  • Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = bx mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).

  • Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.

Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.

I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.

EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).

查看更多
登录 后发表回答