I need a faster way to store and access around 3GB of k:v
pairs. Where k
is a string
or an integer
and v
is an np.array()
that can be of different shapes.
Is there any object, that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame
?
As far I have understood python dict is a quite fast implementation of a hashtable, is there anything better than that for my specific case?
No there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing and even membership checking is approximately O(1).
Once you saved your items in a dictionary, you can access them in a constant time. That said, the problem is not the indexing process. But you might be able to make the process slightly faster by doing some changes in your objects and their types. This might cause some optimizations in under the hood's operations. For example, if your strings (keys) are not very large you can intern them, in order to be cashed in memory rather than being created as an object. If the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. That makes the access to object very faster. Python has provided an intern()
function within sys
module that you can use it for this aim.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop
No, I don't think there is anything faster than dict
. The time complexity of its index checking is O(1)
.
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity
You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.
You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say
HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE
or something similar due to collision we need to use 256^i.
You can try to store them in Trie and see how it performs.