Python: (Pathos) Multiprocessing vs. class methods

2020-03-05 02:09发布

问题:

I am trying to parallelize a code using class methods via multiprocessing. The basic structure is the following:

# from multiprocessing import Pool
from pathos.multiprocessing import ProcessingPool as Pool

class myclass(object):
    def __init__(self):
        #some code
    def mymethod(self):
        #more code
        return another_instance_of_myclass



def myfunc(myinstance,args):
    #some code   
    test=myinstance.mymethod()
    #more code
    return myresult #not an instance,just a number

p=Pool()

result = p.map(myfunc,listwithdata)

After this had failed with the normal multiprocessing, I became aware of the issues with Pickle and Multiprocessing, so I tried to solve it with multiprocessing.pathos. However, I am still getting

PicklingError: Can't pickle <type 'SwigPyObject'>: it's not found as__builtin__.SwigPyObjec

together with lots of errors from pickle.py. Apart from this practical problem, I don't quite understand why anything but the final result of myfunc is being pickled at all.

回答1:

pathos uses dill, and dill serializes classes differently than python's pickle module does. pickle serializes classes by reference. dill (by default) serializes classes directly, and only optionally by reference.

>>> import dill
>>> 
>>> class Foo(object):
...   def __init__(self, x):
...     self.x = x
...   def bar(self, y):
...     return self.x + y * z
...   z = 1
... 
>>> f = Foo(2)
>>> 
>>> dill.dumps(f)  # the dill default, explicitly serialize a class
'\x80\x02cdill.dill\n_create_type\nq\x00(cdill.dill\n_load_type\nq\x01U\x08TypeTypeq\x02\x85q\x03Rq\x04U\x03Fooq\x05h\x01U\nObjectTypeq\x06\x85q\x07Rq\x08\x85q\t}q\n(U\r__slotnames__q\x0b]q\x0cU\n__module__q\rU\x08__main__q\x0eU\x03barq\x0fcdill.dill\n_create_function\nq\x10(cdill.dill\n_unmarshal\nq\x11Uyc\x02\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00C\x00\x00\x00s\x0f\x00\x00\x00|\x00\x00j\x00\x00|\x01\x00t\x01\x00\x14\x17S(\x01\x00\x00\x00N(\x02\x00\x00\x00t\x01\x00\x00\x00xt\x01\x00\x00\x00z(\x02\x00\x00\x00t\x04\x00\x00\x00selft\x01\x00\x00\x00y(\x00\x00\x00\x00(\x00\x00\x00\x00s\x07\x00\x00\x00<stdin>t\x03\x00\x00\x00bar\x04\x00\x00\x00s\x02\x00\x00\x00\x00\x01q\x12\x85q\x13Rq\x14c__builtin__\n__main__\nh\x0fNN}q\x15tq\x16Rq\x17U\x01zq\x18K\x01U\x07__doc__q\x19NU\x08__init__q\x1ah\x10(h\x11Uuc\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\r\x00\x00\x00|\x01\x00|\x00\x00_\x00\x00d\x00\x00S(\x01\x00\x00\x00N(\x01\x00\x00\x00t\x01\x00\x00\x00x(\x02\x00\x00\x00t\x04\x00\x00\x00selfR\x00\x00\x00\x00(\x00\x00\x00\x00(\x00\x00\x00\x00s\x07\x00\x00\x00<stdin>t\x08\x00\x00\x00__init__\x02\x00\x00\x00s\x02\x00\x00\x00\x00\x01q\x1b\x85q\x1cRq\x1dc__builtin__\n__main__\nh\x1aNN}q\x1etq\x1fRq utq!Rq")\x81q#}q$U\x01xq%K\x02sb.'
>>> dill.dumps(f, byref=True)  # the pickle default, serialize by reference
'\x80\x02c__main__\nFoo\nq\x00)\x81q\x01}q\x02U\x01xq\x03K\x02sb.'

Not serializing by reference is much more flexible. However, in rare circumstances, working with references is better (as it appears to be the case when pickling something built on a SwigPyObject).

I have been meaning (for ~2 years) to expose the byref flag to the dump call inside of pathos, but have not done so yet. It should be a fairly simple edit to do so. I've just added a ticket to do so: https://github.com/uqfoundation/pathos/issues/58. While I'm at it, it should also be easy to open up replacement of the dump and load functions that pathos uses… that way you could use customized serializers (i.e. extend those that dill provides, or use some other serializer).



回答2:

In multiprocessing function serialization is needed for interprocess communication. Pickle does a poor job for this purpose, install dill via pip instead. Details (with a nice Star Trek example) can be found here: http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/