BeautifulSoup Object Will Not Pickle, Causes Inter

2019-01-20 10:04发布

I have a soup from BeautifulSoup that I cannot pickle. When I try to pickle the object the python interpreter silently crashes (such that it cannot be handled as an exception). I have to be able to pickle the object in order to return the object using the multiprocessing package (which pickles objects to pass them between processes). How can I troubleshoot/work around the problem? Unfortunately, I cannot post the html for the page (it is not publicly available), and I have been unable to find a reproducible example of the problem. I have tried to isolate the problem by looping over the soup and pickling individual components, the smallest thing that produces the error is <class 'BeautifulSoup.NavigableString'>. When I print the object it prints out u'\n'.

3条回答
Ridiculous、
2楼-- · 2019-01-20 10:46

In fact, as suggested by dekomote, you have only to take advantadge that you can allways convert a soup to an unicode string and then back again the unicode string to a soup.

So IMHO you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups.

查看更多
【Aperson】
3楼-- · 2019-01-20 10:54

If you do not need the beautiful soup object itself, but some product of the soup, i.e. a text string, you can remove BeautifulSoup attributes from your larger object before pickling by adding the following code to your class definition:

class MyObject(MyObject):

    def __getstate__(self):
        for item in dir(self):
            item_type = str(type(getattr(self, item)))
            if 'BeautifulSoup' in itype:
                delattr(self, item)

        return self.__dict__
查看更多
手持菜刀,她持情操
4楼-- · 2019-01-20 10:57

The class NavigableString is not serializable with pickle or cPickle, which multiprocessing uses. You should be able to serialize this class with dill, however. dill has a superset of the pickle interface, and can serialize most of python. multiprocessing will still fail, unless you use a fork of multiprocessing which uses dill, called pathos.multiprocessing.

Get the code here: https://github.com/uqfoundation.


For more information see: What can multiprocessing and dill do together?

http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

http://nbviewer.ipython.org/gist/minrk/5241793

查看更多
登录 后发表回答