A recent talk about unordered_map
in C++ made me realize, that I should use unordered_map
for most cases where I used map
before, because of the efficiency of lookup ( amortized O(1) vs. O(log n) ). Most times I use a map I use either int
's or std::strings
as keys, hence I've got no problems with the definition of the hash function. The more I thought about it, the more I came to realize that I can't find any reason of using a std::map
in case of simple types over a unordered_map
-- I took a look at the interfaces, and didn't find any significant differences that would impact my code.
Hence the question - is there any real reason to use std::map
over unordered map
in case of simple types like int
and std::string
?
I'm asking from a strictly programming point of view -- I know that it's not fully considered standard, and that it may pose problems with porting.
Also I expect that one of the correct answers might be "it's more efficient for smaller sets of data" because of a smaller overhead (is that true?) -- hence I'd like to restrict the question to cases where the amount of keys is non-trivial (>1 024).
Edit: duh, I forgot the obvious (thanks GMan!) -- yes, map's are ordered of course -- I know that, and am looking for other reasons.
I was intrigued by the answer from @Jerry Coffin, which suggested that the ordered map would exhibit performance increases on long strings, after some experimentation (which can be downloaded from pastebin), I've found that this only seems to hold true for collections of random strings, when the map is initialised with a sorted dictionary (which contain words with considerable amounts of prefix-overlap), this rule breaks down, presumably because of the increased tree depth necessary to retrieve value. The results are shown below, the 1st number column is insert time, 2nd is fetch time.
Hash tables have higher constants than common map implementations, which become significant for small containers. Max size is 10, 100, or maybe even 1,000 or more? Constants are the same as ever, but O(log n) is close to O(k). (Remember logarithmic complexity is still really good.)
What makes a good hash function depends on your data's characteristics; so if I don't plan on looking at a custom hash function (but can certainly change my mind later, and easily since I typedef damn near everything) and even though defaults are chosen to perform decently for many data sources, I find the ordered nature of map to be enough of a help initially that I still default to map rather than a hash table in that case.
Plus that way you don't have to even think about writing a hash function for other (usually UDT) types, and just write op< (which you want anyway).
In most languages, unordered map (aka hash based dictionaries) are the default map however in C++ you get ordered map as default map. How did that happen? Some people erroneously assume that C++ committee made this decision in their unique wisdom but the truth is unfortunately uglier than that.
It is widely believed that C++ ended up with ordered map as default because there are not too many parameters on how they can be implemented. On the other hand, hash based implementations has tons of things to talk about. So to avoid gridlocks in standardization they just got along with ordered map. Around 2005, many languages already had good implementations of hash based implementation and so it was more easier for the committee to accept new
std::unordered_map
. In a perfect world,std::map
would have been unordered and we would havestd::ordered_map
as separate type.Below two graphs should speak for themselves (source):
Summary
Assuming ordering is not important:
std::unordered_map
std::map
. This is because reads on it areO(log n)
.std::map
is good option.std::unordered_map
.Significant differences that have not really been adequately mentioned here:
map
keeps iterators to all elements stable, in C++17 you can even move elements from onemap
to the other without invalidating invalidating iterators to them (and if properly implemented without any potential allocation).map
timings for single operations are typically more consistent, since they never need large allocations.unordered_map
usingstd::hash
as implemented in the libstdc++ is vulnerable to DoS if fed with untrusted input (it uses MurmurHash2 with a constant seed - not that seeding would really help, see https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/).Don't forget the
map
's keep their elements ordered. If you can't give up that, obviously you can't use anunordered_map
.Something else to keep in mind is that
unordered_map
's generally use more memory. Amap
just has a few house-keeping pointers then memory for each object. Contrarily,unordered_map
's have a big array (these can get quite big in some implementations) and then additional memory for each object. If you need to be memory-aware, amap
should prove better, because it lacks the large array.So, if you need pure lookup-retrieval, I'd say an
unordered_map
is the way to go. But there are always trade-offs, and if you can't afford them, then you can't use it.Just from personal experience, I found an enormous improvement in performance (measured, of course) when using an
unordered_map
instead of amap
in a main entity look-up table.On the other hand, I found it was much slower at repeatedly inserting and removing elements. It's great for a relatively static collection of elements, but if you're doing tons of insertions and deletions the hashing + bucketing seems to add up. (Note, this was over many iterations.)
I've made a test recently which makes 50000 merge&sort. That means if the string keys are the same, merge the byte string. And the final output should be sorted. So this includes a look up for every insertion.
For the
map
implementation, it takes 200 ms to finish the job. For theunordered_map
+map
, it takes 70 ms forunordered_map
insertion and 80 ms formap
insertion. So the hybrid implementation is 50 ms faster.We should think twice before we use the
map
. If you only need the data to be sorted in the final result of your program, a hybrid solution may be better.