I would like to store a mapping of an integer
key to a float
value in-memory.
I have roughly 130 million keys (and, accordingly, 130 million values).
My focus is on lookup performance -- I have to do many, many millions of lookups.
The C++ STL library has a map
class for associative arrays of this sort. I have several questions about map
.
What is the storage overhead of map
for a dataset of the size mentioned above? How does storage overhead scale, in general, with map
?
It looks like the underlying data structure for map
is a red-black, balanced binary tree. It sounds like the real-world performance for this is O(log n)
for insertion and retrieval.
It mentions O(1)
for a hinted insertion. My input is pre-sorted, so I believe I should be able to provide a hint for insertion events. How would I provide this hint, using the methods listed here?
Is there an STL container that provides better lookup performance?
Are there other publicly-available, open-source frameworks with an associate array class that uses an underlying data structure that would perform better than STL map
?
If writing my own container class would provide better lookup performance, what data structures might I research?
I am using GCC 4 for this task, running under either Linux or Mac OS X.
I apologize in advance if these are dumb questions. Thank you for your advice.
If your keys are unchanging, you might consider a perfect hash function as an alternative to a standard container.
I don't know what obstacles you'll run into with a dataset of that size, but it might be worth spending a few minutes experimenting.
You may have a look at std::tr1::unorderd_map.
But if you have 32 bit unsigned integer keys (4294967296 possible values) and 130 million different keys, then you should write your own container optimized for this task. Especially if the 130 million key case is the usual case (and not only a rare maximum).
4294967296 / 130000000 = 33, so about each 33rd number in the whole space is used in your data.
You could for example divide your key range into fixed size partitions. If the keys are rather evenly distributed, you should partition the key space into e.g. 256-sized buckets, or even 32-sized buckets, depends on how much storage you want to waste when only a few values are stored.
Example, to give you an idea: