I implemented this solution for getting an hash value from vector<T>
:
namespace std
{
template<typename T>
struct hash<vector<T>>
{
typedef vector<T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& in) const
{
size_t size = in.size();
size_t seed = 0;
for (size_t i = 0; i < size; i++)
//Combine the hash of the current vector with the hashes of the previous ones
hash_combine(seed, in[i]);
return seed;
}
};
}
//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
But this solution doesn't scale at all: with a vector<double>
of 10 millions elements it's gonna take more than 2.5 s (according to VS).
Does exists a fast hash function for this scenario?
Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map
will be used in different runs and in addition two vector<double>
with the same content but different addresses will be mapped differently (undesired behavior for this application).
The approach that MSVC's old hashmap used was to sample less often.
This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.
The second thing you should do is not specialize
std::hash
on allvector
s, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as thestd
is sure to permit itself to add hash combining and hashing of vectors).When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.
now
my_hasher
is a universal hasher. It uses either hashes declared inmy_utils::hash_impl
(forstd
types), or free functions calledhash
that will hash a given type, to hash things. Failing that, it tries to usestd::hash<T>
. If that fails, you get a compile-time error.Writing a free
hash
function in the namespace of the type you want to hash tends to be less annoying than having to go off and openstd
and specializestd::hash
in my experience.It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.
It samples said vectors and arrays at about 10 times.
(Note:
hash_tag
is both a bit of a joke, and a way to force ADL and prevent having to forward-declare thehash
specializations in thehash_impl
namespace, because that requirement sucks.)The price of sampling is that you could get more collisions.
Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.
One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.
You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.
You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).
Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.
NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.
I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.
One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.