Fastest way to save and load a large dictionary in

2020-05-26 09:48发布

I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks

6条回答
一夜七次
2楼-- · 2020-05-26 10:04

That is a lot of data... What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?

查看更多
男人必须洒脱
3楼-- · 2020-05-26 10:09

Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.

The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?

If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.

查看更多
Rolldiameter
4楼-- · 2020-05-26 10:10

I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data. Shelve is in fact a dirty solution. That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed And that is why shelve is a dirty solution... It's still faster though. So!

查看更多
Emotional °昔
5楼-- · 2020-05-26 10:17

You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.

查看更多
啃猪蹄的小仙女
6楼-- · 2020-05-26 10:19

Sqlite

It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.

You also get transactions, atomicity, serialization, compression, etc. for free.

Depending on what version of Python you're using, you might already have sqlite built-in.

查看更多
狗以群分
7楼-- · 2020-05-26 10:23

I know it's an old question but just as an update for those who still looking for an answer to this question: The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2. You can read about it more in the reference.

In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:

import pickle
# ...
with open('data.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
查看更多
登录 后发表回答