More specific dupe of 875228—Simple data storing in Python.
I have a rather large dict (6 GB) and I need to do some processing on it. I'm trying out several document clustering methods, so I need to have the whole thing in memory at once. I have other functions to run on this data, but the contents will not change.
Currently, every time I think of new functions I have to write them, and then re-generate the dict. I'm looking for a way to write this dict to a file, so that I can load it into memory instead of recalculating all it's values.
to oversimplify things it looks something like: {((('word','list'),(1,2),(1,3)),(...)):0.0, ....}
I feel that python must have a better way than me looping around through some string looking for : and ( trying to parse it into a dictionary.
I'd use
shelve
,json
,yaml
, or whatever, as suggested by other answers.shelve
is specially cool because you can have thedict
on disk and still use it. Values will be loaded on-demand.But if you really want to parse the text of the
dict
, and it contains onlystr
ings,int
s andtuple
s like you've shown, you can useast.literal_eval
to parse it. It is a lot safer, since you can't eval full expressions with it - It only works withstr
ings, numbers,tuple
s,list
s,dict
s,bool
eans, andNone
:Write it out in a serialized format, such as pickle (a python standard library module for serialization) or perhaps by using JSON (which is a representation that can be evaled to produce the memory representation again).
Why not use python pickle? Python has a great serializing module called pickle it is very easy to use.
There are two disadvantages with pickle:
If you are using python 2.6 there is a builtin module called json. It is as easy as pickle to use:
Json format is human readable and is very similar to the dictionary string representation in python. And doesn't have any security issues like pickle. But might be slower than cPickle.
I would suggest that you use YAML for your file format so you can tinker with it on the disc
To get it in python, just easy_install pyyaml. See http://pyyaml.org/
It comes with easy file save / load functions, that I can't remember right this minute.
This solution at SourceForge uses only standard Python modules:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net
The compression bonus will probably reduce your 6GB dictionary to 1GB. If you do not want a store a series of dictionaries, the module also contains a file.gz solution which might be more suitable given your dictionary size.
Here are a few alternatives depending on your requirements:
numpy
stores your plain data in a compact form and performs group/mass operations wellshelve
is like a large dict backed up by a filesome 3rd party storage module, e.g.
stash
, stores arbitrary plain dataproper database, e.g. mongodb for hairy data or mysql or sqlite plain data and faster retrieval