Tuple vs string as a Dictionary key in C#

2019-03-23 07:37发布

问题:

I have a cache that I implement using a ConcurrentDictionary, The data that I need to keep depends on 5 parameters. So the Method to get it from the cache is: (I show only 3 parameters here for simplicity, and I changed the data type to represent CarData for clearity)

public CarData GetCarData(string carModel, string engineType, int year);

I wonder what type of key will be better to use in my ConcurrentDictionary, I can do it like this:

var carCache = new ConcurrentDictionary<string, CarData>();
// check for car key
bool exists = carCache.ContainsKey(string.Format("{0}_{1}_{2}", carModel, engineType, year);

Or like this:

var carCache = new ConcurrentDictionary<Tuple<string, string, int>, CarData>();
// check for car key
bool exists = carCache.ContainsKey(new Tuple(carModel, engineType, year));

I don't use these parameters together any other place, so there is no justification to create a class just to keep them together.

I want to know which approach is a better in terms of performance and maintainability.

回答1:

You could create a class (doesn't matter that its only used here) that overrides GetHashCode and Equals:

Thanks theDmi (and others) for improvements...

public class CarKey : IEquatable<CarKey>
{
    public CarKey(string carModel, string engineType, int year)
    {
        CarModel = carModel;
        EngineType= engineType;
        Year= year;
    }

    public string CarModel {get;}
    public string EngineType {get;}
    public int Year {get;}

    public override int GetHashCode()
    {
        unchecked // Overflow is fine, just wrap
        {
            int hash = (int) 2166136261;

            hash = (hash * 16777619) ^ CarModel?.GetHashCode() ?? 0;
            hash = (hash * 16777619) ^ EngineType?.GetHashCode() ?? 0;
            hash = (hash * 16777619) ^ Year.GetHashCode();
            return hash;
        }
    }

    public override bool Equals(object other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        if (other.GetType() != GetType()) return false;
        return Equals(other as CarKey);
    }

    public bool Equals(CarKey other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return string.Equals(CarModel,obj.CarModel) && string.Equals(EngineType, obj.EngineType) && Year == obj.Year;
    }
}

If you don't override those, ContainsKey does a reference equals.

Note: the Tuple class does have its own equality functions that would basically do the same as above. Using a bespoke class makes it clear that is what is intended to happen - and is therefore better for maintainability. It also has the advantage that you can name the properties so it is clear

Note 2: the class is immutable as dictionary keys need to be to avoid potential bugs with hashcodes changing after the object is added to the dictionary See here

GetHashCode taken from here



回答2:

I want to know which approach is a better in terms of performance and maintainability.

As always, you have the tools to figure it out. Code both possible solutions and make them race. The one that wins is the winner, you don't need anyone here to answer this particular question.

About maintenance, the solution that autodocuments itself better and has better scalability should be the winner. In this case, the code is so trivial that autodocumentation isn't that much of an issue. From a scalability point of view, IMHO, the best solution is to use Tuple<T1, T2, ...>:

  • You get free equality semantics which you don't need to maintain.
  • Collisions are not possible, something that is not true if you choose the string concatenation solution:

    var param1 = "Hey_I'm a weird string";
    var param2 = "!"
    var param3 = 1;
    key = "Hey_I'm a weird string_!_1";
    
    var param1 = "Hey";
    var param2 = "I'm a weird string_!"
    var param3 = 1;
    key = "Hey_I'm a weird string_!_1";
    

    Yeah, far fetched, but, in theory, entirely possible and your question is precisely about unknown events in the future, so...

  • And last, but not least, the compiler helps you maintain the code. If, for example, tomorrow you have to add param4 to your key, Tuple<T1, T2, T3, T4> will strongly type your key. Your string concatenation algorithm on the other hand can live on blissfully happy generating keys without param4 and you wont know whats happening until your client calls you up because their software is not working as expected.



回答3:

If performance is really important, then the answer is that you shouldn't use either option, because both unnecessarily allocate an object on every access.

Instead, you should use a struct, either a custom one, or ValueTuple from the System.ValueTuple package:

var myCache = new ConcurrentDictionary<ValueTuple<string, string, int>, CachedData>();
bool exists = myCache.ContainsKey(ValueTuple.Create(param1, param2, param3));

C# 7.0 also contais syntax sugar to make this code easier to write (but you don't need to wait for C# 7.0 to start using ValueTuple without the sugar):

var myCache = new ConcurrentDictionary<(string, string, int), CachedData>();
bool exists = myCache.ContainsKey((param1, param2, param3));


回答4:

Implement a custom key class and make sure it is suitable for such use cases, i.e. implement IEquatable and make the class immutable:

public class CacheKey : IEquatable<CacheKey>
{
    public CacheKey(string param1, string param2, int param3)
    {
        Param1 = param1;
        Param2 = param2;
        Param3 = param3;
    }

    public string Param1 { get; }

    public string Param2 { get; }

    public int Param3 { get; }

    public bool Equals(CacheKey other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return string.Equals(Param1, other.Param1) && string.Equals(Param2, other.Param2) && Param3 == other.Param3;
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != GetType()) return false;
        return Equals((CacheKey)obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = Param1?.GetHashCode() ?? 0;
            hashCode = (hashCode * 397) ^ (Param2?.GetHashCode() ?? 0);
            hashCode = (hashCode * 397) ^ Param3;
            return hashCode;
        }
    }
}

This is a GetHashCode() implementation how Resharper generates it. It is a good general-purpose implementation. Adapt as required.


Alternatively, use something like Equ (I'm the creator of that library) that automatically generates Equals and GetHashCode implementations. This will make sure that these methods always include all members of the CacheKey class, so the code becomes much easier to maintain. Such an implementation would then simply look like this:

public class CacheKey : MemberwiseEquatable<CacheKey>
{
    public CacheKey(string param1, string param2, int param3)
    {
        Param1 = param1;
        Param2 = param2;
        Param3 = param3;
    }

    public string Param1 { get; }

    public string Param2 { get; }

    public int Param3 { get; }
}

Note: You should obviously use meaningful property names, otherwise introducing a custom class does not provide much benefit over using a Tuple.



回答5:

I wanted to compare the Tuple versus Class versus "id_id_id" approaches described in the other comments. I used this simple code:

public class Key : IEquatable<Key>
{
    public string Param1 { get; set; }
    public string Param2 { get; set; }
    public int Param3 { get; set; }

    public bool Equals(Key other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return string.Equals(Param1, other.Param1) && string.Equals(Param2, other.Param2) && Param3 == other.Param3;
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != this.GetType()) return false;
        return Equals((Key) obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = (Param1 != null ? Param1.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ (Param2 != null ? Param2.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ Param3;
            return hashCode;
        }
    }
}

static class Program
{

    static void TestClass()
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var classDictionary = new Dictionary<Key, string>();

        for (var i = 0; i < 10000000; i++)
        {
            classDictionary.Add(new Key { Param1 = i.ToString(), Param2 = i.ToString(), Param3 = i }, i.ToString());
        }
        stopwatch.Stop();
        Console.WriteLine($"initialization: {stopwatch.Elapsed}");

        stopwatch.Restart();

        for (var i = 0; i < 10000000; i++)
        {
            var s = classDictionary[new Key { Param1 = i.ToString(), Param2 = i.ToString(), Param3 = i }];
        }

        stopwatch.Stop();
        Console.WriteLine($"Retrieving: {stopwatch.Elapsed}");
    }

    static void TestTuple()
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var tupleDictionary = new Dictionary<Tuple<string, string, int>, string>();

        for (var i = 0; i < 10000000; i++)
        {
            tupleDictionary.Add(new Tuple<string, string, int>(i.ToString(), i.ToString(), i), i.ToString());
        }
        stopwatch.Stop();
        Console.WriteLine($"initialization: {stopwatch.Elapsed}");

        stopwatch.Restart();

        for (var i = 0; i < 10000000; i++)
        {
            var s = tupleDictionary[new Tuple<string, string, int>(i.ToString(), i.ToString(), i)];
        }

        stopwatch.Stop();
        Console.WriteLine($"Retrieving: {stopwatch.Elapsed}");
    }

    static void TestFlat()
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var tupleDictionary = new Dictionary<string, string>();

        for (var i = 0; i < 10000000; i++)
        {
            tupleDictionary.Add($"{i}_{i}_{i}", i.ToString());
        }
        stopwatch.Stop();
        Console.WriteLine($"initialization: {stopwatch.Elapsed}");

        stopwatch.Restart();

        for (var i = 0; i < 10000000; i++)
        {
            var s = tupleDictionary[$"{i}_{i}_{i}"];
        }

        stopwatch.Stop();
        Console.WriteLine($"Retrieving: {stopwatch.Elapsed}");
    }

    static void Main()
    {
        TestClass();
        TestTuple();
        TestFlat();
    }
}

Results:

I ran each method 3 times in Release without debugging, each run commenting out the calls to the other methods. I took the average of the 3 runs, but there wasn't much variance anyway.

TestTuple:

initialization: 00:00:14.2512736
Retrieving: 00:00:08.1912167

TestClass:

initialization: 00:00:11.5091160
Retrieving: 00:00:05.5127963

TestFlat:

initialization: 00:00:16.3672901
Retrieving: 00:00:08.6512009

I was surprised to see that the class approach was faster than both the tuple approach and the string approach. In my opinion it's more readable and more future-safe, in the sense that more functionality can be added to the Key class (assuming it's not just a key, it represents something).



回答6:

IMHO, I prefer to use in such cases some intermediate structure (in your case it will be Tuple). Such approach creates additional layer between parameters and end-target dictionary. Of course, it will be depend on purposes. Such way for example allows you to create not trivial transition of parameters (for example container may "distort" data).



回答7:

I ran Tomer's test cases, adding ValueTuples as a test case (new c# value type). Was impressed at how well they performed.

TestClass
initialization: 00:00:11.8787245
Retrieving: 00:00:06.3609475

TestTuple
initialization: 00:00:14.6531189
Retrieving: 00:00:08.5906265

TestValueTuple
initialization: 00:00:10.8491263
Retrieving: 00:00:06.6928401

TestFlat
initialization: 00:00:16.6559780
Retrieving: 00:00:08.5257845

Code for the test is below:

static void TestValueTuple(int n = 10000000)
{
    var stopwatch = new Stopwatch();
    stopwatch.Start();
    var tupleDictionary = new Dictionary<(string, string, int), string>();

    for (var i = 0; i < n; i++)
    {
        tupleDictionary.Add((i.ToString(), i.ToString(), i), i.ToString());
    }
    stopwatch.Stop();
    Console.WriteLine($"initialization: {stopwatch.Elapsed}");

    stopwatch.Restart();

    for (var i = 0; i < n; i++)
    {
        var s = tupleDictionary[(i.ToString(), i.ToString(), i)];
    }

    stopwatch.Stop();
    Console.WriteLine($"Retrieving: {stopwatch.Elapsed}");
}