There are many solutions to serialize a small dictionary: json.loads
/json.dumps
, pickle
, shelve
, ujson
, or even by using sqlite
.
But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.
redis
is not really an option because it uses a client/server scheme.
Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?
I'm looking for a solution with a standard "Pythonic" d[key] = value
syntax:
import mydb
d = mydb.mydb('myfile.db')
d['hello'] = 17 # able to use string or int or float as key
d[183] = [12, 14, 24] # able to store lists as values (will probably internally jsonify it?)
d.flush() # easy to flush on disk
Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?
You can use sqlitedict which provides key-value interface to SQLite database.
SQLite limits page says that theoretical maximum is 140 TB depending on page_size
and max_page_count
. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3
2.6.0), are page_size=1024
and max_page_count=1073741823
. This gives ~1100 GB of maximal database size which fits your requirement.
You can use the package like:
from sqlitedict import SqliteDict
mydict = SqliteDict('./my_db.sqlite', autocommit=True)
mydict['some_key'] = any_picklable_object
print(mydict['some_key'])
for key, value in mydict.items():
print(key, value)
print(len(mydict))
mydict.close()
Update
About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size
pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:
pip install lipsum psutil matplotlib psrecord sqlitedict
sqlitedct.py
#!/usr/bin/env python3
import os
import random
from contextlib import closing
import lipsum
from sqlitedict import SqliteDict
def main():
with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:
for _ in range(100000):
v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
d[os.urandom(10)] = v
if __name__ == '__main__':
main()
Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!
. In my case it produces this chart:
And database file:
$ du -h my_db.sqlite
84M my_db.sqlite
I would consider HDF5 for this. It has several advantages:
- Usable from many programming languages.
- Usable from Python via the excellent h5py package.
- Battle tested, including with large data sets.
- Supports variable-length string values.
- Values are addressable by a filesystem-like "path" (
/foo/bar
).
- Values can be arrays (and usually are), but do not have to be.
- Optional built-in compression.
- Optional "chunking" to allow writing chunks incrementally.
- Does not require loading the entire data set into memory at once.
It does have some disadvantages too:
- Extremely flexible, to the point of making it hard to define a single approach.
- Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g.
h5py
).
- Baroque C/C++ API (the Python one is not so).
- Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.
You can think of HDF5 as a way to store values (scalars or N-dimensional
arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."
First, bsddb (or under it's new name Oracle BerkeleyDB) is not deprecated.
From experience LevelDB / RocksDB / bsddb are slower than wiredtiger, that's why I recommend wiredtiger.
wiredtiger is the storage engine for mongodb so it's well tested in production. There is little or no use of wiredtiger in Python outside my AjguDB project; I use wiredtiger (via AjguDB) to store and query wikidata and concept which around 80GB.
Here is an example class that allows mimick the python2 shelve module. Basically,
it's a wiredtiger backend dictionary where keys can only be strings:
import json
from wiredtiger import wiredtiger_open
WT_NOT_FOUND = -31803
class WTDict:
"""Create a wiredtiger backed dictionary"""
def __init__(self, path, config='create'):
self._cnx = wiredtiger_open(path, config)
self._session = self._cnx.open_session()
# define key value table
self._session.create('table:keyvalue', 'key_format=S,value_format=S')
self._keyvalue = self._session.open_cursor('table:keyvalue')
def __enter__(self):
return self
def close(self):
self._cnx.close()
def __exit__(self, *args, **kwargs):
self.close()
def _loads(self, value):
return json.loads(value)
def _dumps(self, value):
return json.dumps(value)
def __getitem__(self, key):
self._session.begin_transaction()
self._keyvalue.set_key(key)
if self._keyvalue.search() == WT_NOT_FOUND:
raise KeyError()
out = self._loads(self._keyvalue.get_value())
self._session.commit_transaction()
return out
def __setitem__(self, key, value):
self._session.begin_transaction()
self._keyvalue.set_key(key)
self._keyvalue.set_value(self._dumps(value))
self._keyvalue.insert()
self._session.commit_transaction()
Here the adapted test program from @saaj answer:
#!/usr/bin/env python3
import os
import random
import lipsum
from wtdict import WTDict
def main():
with WTDict('wt') as wt:
for _ in range(100000):
v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
wt[os.urandom(10)] = v
if __name__ == '__main__':
main()
Using the following command line:
python test-wtdict.py & psrecord --plot=plot.png --interval=0.1 $!
I generated the following diagram:
$ du -h wt
60M wt
When write-ahead-log is active:
$ du -h wt
260M wt
This is without performance tunning and compression.
Wiredtiger has no known limit until recently, the documentation was updated to the following:
WiredTiger supports petabyte tables, records up to 4GB, and record numbers up to 64-bits.
http://source.wiredtiger.com/1.6.4/architecture.html