HashSets don't keep the elements unique if you

2020-07-11 08:37发布

问题:

When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):

void Main()
{
    HashSet<Tester> testset = new HashSet<Tester>();
    testset.Add(new Tester(1));
    testset.Add(new Tester(2));
    foreach(Tester tester in testset){
      tester.Dump();
    }
    foreach(Tester tester in testset){
      tester.myint = 3;
    }
    foreach(Tester tester in testset){
      tester.Dump();
    }
    HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
    foreach(Tester tester in secondhashset){
      tester.Dump();
    }
}

class Tester{
  public int myint;

  public Tester(int i){
    this.myint = i;
  }

  public override bool Equals(object o){
    if (o== null) return false;
    Tester that = o as Tester;
    if (that == null) return false;
    return (this.myint == that.myint);
  }

  public override int GetHashCode(){
    return this.myint;
  }

  public override string ToString(){
    return this.myint.ToString();
  }
}

It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.

Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.

Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?

回答1:

You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.

  • If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
  • If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
  • If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
  • If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
  • Perhaps your objects should be immutable value types so the object can't change
  • If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object

This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.



回答2:

When I need a 1-dimensional collection of guaranteed unique items I usually go with Dictionary<TKey, Tvalue>: you cannot add elements with the same Key, plus I usually need to attach some properties to the items and the Value comes in handy (my go-to value type is Tuple<> for many values...).

OF course, it's not the most performant nor the least memory-hungry solution, but I don't usually have performance/memory concerns.



回答3:

You should implement your own IEqualityComparer and pass it to the constructor of the HashSet to ensure you get the desired equality comparer.

And as Joe said, if you want the collection to remain unique even beyond .Add(T item) you need to use ValueObjects that are created by the constructor and have no publicly visible set attributes. i.e.