Python: storing big data structures

I'm currently doing a project in python that uses dictionaries that are relatively big (around 800 MB). I tried to store one of this dictionaries by using pickle, but got an MemoryError.

What is the proper way to save this kind of files in python? Should I use a database?

标签： python data-structures pickle

5条回答

Summer. ? 凉城

2楼-- · 2019-03-01 01:27

Perhaps you could use sqlite3? Unless you have a real old version of Python, it ought to be available: https://docs.python.org/2/library/sqlite3.html

I have not checked the limitations of sqlite3, and I have no knowledge of its usefulness in your situation, but it would be worth checking out.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2019-03-01 01:27

When you pickle the entire data structure, you are limited by system RAM. You can, however, do it in chunks.

streaming-pickle looks like a solution for pickling file-like objects larger than memory on board.

https://gist.github.com/hardbyte/5955010

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2019-03-01 01:45

Python-standard shelve module provides dict-like interface for persistent objects. It works with many database backends and is not limited by RAM. The advantage of using shelve over direct work with databases is that most of your existing code remains as it was. This comes at the cost of speed (compared to in-RAM dicts) and at the cost of flexibility (compared to working directly with databases).

0人赞添加讨论(0) 举报

Emotional °昔

5楼-- · 2019-03-01 01:46

As opposed to shelf, klepto doesn't need to store the entire dict in a single file (using a single file is very slow for read-write when you only need one entry). Also, as opposed to shelf, klepto can store almost any type of python object you can put in a dictionary (you can store functions, lambdas, class instances, sockets, multiprocessing queues, whatever).

klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers a variety of caching algorithms (like mru, lru, lfu, etc) to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.

>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True) 
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo          
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem 
>>> demo.dump()
>>> del demo
>>> 
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>

klepto also provides the use of memory-mapped file backends, for fast read-write. There are other flags such as compression that can be used to further customize how your data is stored. It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database.

>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]

Get klepto here: https://github.com/uqfoundation

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2019-03-01 01:50

Since it is a dictionary, you can convert it to a list of key value pairs ([(k, v)]). You can then serialize each tuple into a string with whatever technology you'd like (like pickle), and store them onto a file line by line. This way, parallelizing processes, checking the file's content etc. is also easier.

There are libraries that allows you to stream with single objects, but IMO it just makes it more complicated. Just storing it line by line removes so much headache.

0人赞添加讨论(0) 举报

Python: storing big data structures

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间