How can I speed up unpickling large objects if I h

2020-02-04 07:02发布

It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).

Note that the file quickly loads into memory. In other words, if I run:

import cPickle as pickle

f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages

How can I speed this last operation up?

Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.

I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.

8条回答
女痞
2楼-- · 2020-02-04 07:35

In general, I've found that if possible, when saving large objects to disk in python, it's much more efficient to use numpy ndarrays or scipy.sparse matrices.

Thus for huge graphs like the one in the example, I could convert the graph to a scipy sparse matrix (networkx has a function that does this, and it's not hard to write one), and then save that sparse matrix in binary format.

查看更多
戒情不戒烟
3楼-- · 2020-02-04 07:37

Maybe the best thing you can do is to split the big data into smallest object smaller, let's say, than 50MB, so can be stored in ram, and recombine it.

Afaik there's no way to automatic splitting data via pickle module, so you have to do by yourself.

Anyway, another way (which is quite harder) is to use some NoSQL Database like MongoDB to store your data...

查看更多
何必那么认真
4楼-- · 2020-02-04 07:39

You're probably bound by Python object creation/allocation overhead, not the unpickling itself. If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

查看更多
疯言疯语
5楼-- · 2020-02-04 07:40

Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.

Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

查看更多
闹够了就滚
6楼-- · 2020-02-04 07:42

why don't you use pickle.load?

f = open('fname', 'rb')
graph = pickle.load(f)
查看更多
够拽才男人
7楼-- · 2020-02-04 07:43

I'm also trying to speed up the loading/storing of networkx graphs. I'm using the adjacency_graph method to convert the graph to something serialisable, see for instance this code:

from networkx.generators import fast_gnp_random_graph
from networkx.readwrite import json_graph

G = fast_gnp_random_graph(4000, 0.7)

with open('/tmp/graph.pickle', 'wb+') as f:
  data = json_graph.adjacency_data(G)
  pickle.dump(data, f)

with open('/tmp/graph.pickle', 'rb') as f:
  d = pickle.load(f)
  H = json_graph.adjacency_graph(d)

However, this adjacency_graph conversion method is quite slow, so time gained in pickling is probably lost on converting.

So this actually doesn't speed things up, bummer. Running this code gives the following timings:

N=1000

    0.666s ~ generating
    0.790s ~ converting
    0.237s ~ storing
    0.295s ~ loading
    1.152s ~ converting

N=2000

    2.761s ~ generating
    3.282s ~ converting
    1.068s ~ storing
    1.105s ~ loading
    4.941s ~ converting

N=3000

    6.377s ~ generating
    7.644s ~ converting
    2.464s ~ storing
    2.393s ~ loading
    12.219s ~ converting

N=4000

    12.458s ~ generating
    19.025s ~ converting
    8.825s ~ storing
    8.921s ~ loading
    27.601s ~ converting

This exponential growth is probably due to the graph getting exponentially more edges. Here is a test gist, in case you want to try yourself

https://gist.github.com/wires/5918834712a64297d7d1

查看更多
登录 后发表回答