Create Hash Value on a List?

2020-03-10 05:50发布

问题:

I have a List<MyRichObject> with 50 instances in it. Each of the instances has 1 or 2 unique properties, but in a way they are all unique because there is only one at position in the list, etc.

I would like to come up with a unique way to "hash" this List so it is unique from all of the other Lists. Is there a smart way to do that in .NET 4?

The purpose is to create a kind of "monniker" for the Lists so they can be dumped into a queue and found later based on their unique value.

Thanks.

回答1:

TL;DR

public static int GetSequenceHashCode<T>(this IList<T> sequence)
{
    const int seed = 487;
    const int modifier = 31;

    unchecked
    {
        return sequence.Aggregate(seed, (current, item) =>
            (current*modifier) + item.GetHashCode());
    }            
}

Why bother with another answer?

The accepted answer can give dangerously inaccurate results if you have multiple items in the list with the same hash code. For example consider these inputs:

var a = new []{ "foo" };
var b = new []{ "foo", "bar" };
var c = new []{ "foo", "bar", "spam" };
var d = new []{ "seenoevil", "hearnoevil", "speaknoevil" };

These all produce different results suggesting they are all unique collections. Great! Now let's try with a duplicate:

var e = new []{ "foo", "bar", "spam" };

GetSequenceHashCode should produce the same result for both c and e - and it does. So far so good. Now let's try with items out of sequence:

var f = new []{ "spam", "bar", "foo" };

Uh oh... GetSequenceHashCode indicates that f is equal to both c and e which it is not. Why is this happening? Break it down into the actual hash code values first, using c as an example:

int hashC = "foo".GetHashCode() ^ 
            "bar".GetHashCode() ^ 
            "spam".GetHashCode();

Since the exact numbers here aren't really important and for the sake of clearer demonstration let's pretend the hash codes of the three strings are foo=8, bar=16 and spam=32. So:

int hashC = 8 ^ 16 ^ 32;

or to break it down into binary representation:

8 ^ 16 ^ 32 == 56;

//  8 = 00001000
//  ^
// 16 = 00010000
//  ^
// 32 = 00100000
//  =
// 56   00111000

Now you should see why the order of items in the list is overlooked by this implementation, i.e. 8^16^32 = 16^8^32 = 32^16^8 etc.

Secondly there's an issue with duplicates. Even if you assume that having the same contents in a different sequence is OK (which is not an approach I would encourage), I don't think anyone will argue the below behaviour is desirable. Let's try variations with duplicates within each list.

var a = new []{ "foo", "bar", "spam" };
var b = new []{ "foo", "bar", "spam", "foo" };
var c = new []{ "foo", "bar", "spam", "foo", "foo" };
var d = new []{ "foo", "bar", "spam", "foo", "foo", "spam", "foo", "spam", "foo" };

While a and b generate different seqeuence hashes, GetSequenceHashCode suggests that a, c and d are all the same. Why?

If you XOR a number with itself you essentially cancel it out, i.e.

8 ^ 8 == 0;

//  8 = 00001000
//  ^
//  8 = 00001000
//  =
//  0 = 00000000

XOR by the same number again gives you the original result, i.e.

8 ^ 8 ^ 8 == 8;

//  8 = 00001000
//  ^
//  8 = 00001000
//  ^
//  8 = 00001000
//  =
//  8 = 00001000

So if we look at a and c again, substituting the simplified hash codes:

var a = new []{ 8, 16, 32 };
var c = new []{ 8, 16, 32, 8, 8 };

the hash codes are caclulated as:

int hashA = 8 ^ 16 ^ 32;         // = 56
int hashC = 8 ^ 16 ^ 32 ^ 8 ^ 8; // = 56
                       // ↑   ↑ 
                       // these two cancel each other out

and likewise with d where each pair of foo and spam cancels itself out.



回答2:

Does the hash have to be representative of the list's contents? In other words will you use the hash to determine potential equality? If not then just create a new Guid and use that.

If the identifier does need to represent the contents of the list then you can either generate a hashcode based on the contents of the list (this will be inefficient as you will be unable to cache this value as the list's contents may change) or forgo the hash altogether and use Enumerable.SequenceEquals to determine equality.


Here is an example of how I would implement getting a hash code for a List<T>. First of all, if you are going to get a hash code for a particular object your really ought to make sure that object will not change. If that object does change then your hash code is no longer any good.

The best way to work with a list that can be "frozen" (meaning no items added or removed after a certain point) is to call AsReadOnly. This will give you a ReadOnlyCollection<T>. The implementation below hinges on a ReadOnlyCollection<T> just to be safe so keep that in mind:

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Linq;

class Example
{
    static void Main()
    {
        var seqOne = new List<int> { 1, 2, 3, 4, 5, 6 };
        var seqTwo = new List<int> { 6, 5, 4, 3, 2, 1 };

        var seqOneCode = seqOne.AsReadOnly().GetSequenceHashCode();
        var seqTwoCode = seqTwo.AsReadOnly().GetSequenceHashCode();

        Console.WriteLine(seqOneCode == seqTwoCode);
    }
}

static class Extensions
{
    public static int GetSequenceHashCode<T>(this ReadOnlyCollection<T> sequence)
    {
        return sequence
            .Select(item => item.GetHashCode())
            .Aggregate((total, nextCode) => total ^ nextCode);
    }
}

Oh, one last thing - make sure that your MyRichObject type has a good GetHashCode implementation itself otherwise your hash code for the list will potentially yield a lot of false positives upon comparison.