I have .NET and C++ implementations of a perf test function that does 854,750 lookups in a dictionary using string keys from a pool of 6838 keys. I wrote these functions to investigate a perf bottleneck in a real app.
.NET implementation is written in F#, uses Dictionary and is compiled for .NET 4.0
C++ implementation uses std::unordered_map and is built with VS2010 in Release mode.
On my machine .NET code runs in 240 ms on average and C++ code runs in 630 ms. Could you please help me to understand what can be the reason for this huge difference in speed?
If I make key length in C++ implementation shorter and use "key_" prefix instead of "key_prefix_" it will run in 140 ms.
Another trick I tried is to replace std::string with a custom immutable string implementation that has a const char* pointer to the source and a one-time computed hash. Using this string allowed to get performance of C++ implementation down to 190 ms.
C++ code:
struct SomeData
{
public:
float Value;
};
typedef std::string KeyString;
typedef std::unordered_map<KeyString, SomeData> DictionaryT;
const int MaxNumberOfRuns = 125;
const int MaxNumberOfKeys = 6838;
DictionaryT dictionary;
dictionary.rehash(MaxNumberOfKeys);
auto timer = Stopwatch::StartNew();
int lookupCount = 0;
char keyBuffer[100] = "key_prefix_";
size_t keyPrefixLen = std::strlen(keyBuffer);
/// run MaxNumberOfRuns * MaxNumberOfKeys iterations
for(int runId = 0; runId < MaxNumberOfRuns; runId++)
{
for(int keyId = 0; keyId < MaxNumberOfKeys; keyId++)
{
/// get a new key from the pool of MaxNumberOfKeys keys
int randomKeySuffix = (std::rand() % MaxNumberOfKeys);
::itoa(randomKeySuffix, keyBuffer + keyPrefixLen, 10);
KeyString key = keyBuffer;
/// lookup key in the dictionary
auto dataIter = dictionary.find(key);
SomeData* data;
if(dataIter != dictionary.end())
{
/// get existing value
data = &dataIter->second;
}
else
{
/// add a new value
data = &dictionary.insert(dataIter, DictionaryT::value_type(key, SomeData()))->second;
}
/// update corresponding value in the dictionary
data->Value += keyId * runId;
lookupCount++;
}
}
timer.Stop();
std::cout << "Time: " << timer.GetElapsedMilleseconds() << " ms" << std::endl;
std::cout << "Lookup count: " << lookupCount << std::endl;
Prints:
Time: 636 ms
Lookup count: 854750
F# code
open System
open System.Diagnostics
open System.Collections.Generic
type SomeData =
struct
val mutable Value : float
end
let dictionary = new Dictionary<string, SomeData>()
let randomGen = new Random()
let MaxNumberOfRuns = 125
let MaxNumberOfKeys = 6838
let timer = Stopwatch.StartNew()
let mutable lookupCount = 0
/// run MaxNumberOfRuns * MaxNumberOfKeys iterations
for runId in 1 .. MaxNumberOfRuns do
for keyId in 1 .. MaxNumberOfKeys do
/// get a new key from the pool of MaxNumberOfKeys keys
let randomKeySuffix = randomGen.Next(0, MaxNumberOfKeys).ToString()
let key = "key_prefix_" + randomKeySuffix
/// lookup key in the dictionary
let mutable found, someData = dictionary.TryGetValue (key)
if not(found) then
/// add a new value
someData <- new SomeData()
dictionary.[key] <- someData
/// update corresponding value in the dictionary
someData.Value <- someData.Value + float(keyId) * float(runId)
lookupCount <- lookupCount + 1
timer.Stop()
printfn "Time: %d ms" timer.ElapsedMilliseconds
printfn "Lookup count: %d" lookupCount
Prints:
Time: 245 ms
Lookup count: 854750
Visual Studio 2010 uses a performant hash function for
std::string
, rather than an accurate one. Basically, if the key string is larger than 10 characters, the hash function stops using every character for the hash, and has a stride greater than1
.size() >= 10
- use every second character after the firstsize() >= 20
- use every third character after the firstThanks to this, collisions happen more frequently, which slows the code down of course. Try a custom hash function for the C++ version.
We can only speculate why one version is faster than the other. You could definitely use a profiler here to tell you where the hot spot is. So don't take any of this as a definitive answer.
Your note about the c++ version being faster with a shorter key length is illuminating because it could point to a couple of things:
Off the cuff, here's some wild observations based on my experience with unordered_map (though I'm more familiar with the boost implementation that Microsoft's).
In this example there's no reason to use a std::string as the key type, just use the integer value. This would presumably make both c++ and F# versions faster.
When you insert values into the map, it's probably not faster to do a find followed by an insert since both will require re-hashing the key string. Just used the [] operator, which does a find-or-insert operation on it's own. I guess this would depend on how often you find a hit in the map versus add a new value.
If allocation is the bottleneck and you must use a string key type you might get better performance out of storing a shared ptr to the string rather than copying the string when you insert it into the map.
Try providing your own hash function for the key type that ignores the "key_prefix_" part of the string
Try boost's implementation; maybe it's faster.
Again, a profile run would quickly tell you where to look for this kind of problem. Specifically, it will tell you if there's a bottleneck in hashing vs. a bottleneck in allocation.
When you're dealing with pure data-structure code, a speed ratio of 2.6 is not that strange. Take a look at the slides on this project to see what I mean.