How to pickle and unpickle to portable string in P

2020-05-25 08:32发布

问题:

I need to pickle a Python3 object to a string which I want to unpickle from an environmental variable in a Travis CI build. The problem is that I can't seem to find a way to pickle to a portable string (unicode) in Python3:

import os, pickle    

from my_module import MyPickleableClass


obj = {'cls': MyPickleableClass, 'other_stuf': '(...)'}

pickled = pickle.dumps(obj)

# raises TypeError: str expected, not bytes
os.environ['pickled'] = pickled

# raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb (...)
os.environ['pickled'] = pickled.decode('utf-8')

pickle.loads(os.environ['pickled'])

Is there a way to serialize complex objects like datetime.datetime to unicode or to some other string representation in Python3 which I can transfer to a different machine and deserialize?

Update

I have tested the solutions suggested by @kindall, but the pickle.dumps(obj, 0).decode() raises a UnicodeDecodeError. Nevertheless the base64 approach works but it needed an extra decode/encode step. The solution works on both Python2.x and Python3.x.

# encode returns bytes so it needs to be decoded to string
pickled = pickle.loads(codecs.decode(pickled.encode(), 'base64')).decode()

type(pickled)  # <class 'str'>

unpickled = pickle.loads(codecs.decode(pickled.encode(), 'base64'))

回答1:

pickle.dumps() produces a bytes object. Expecting these arbitrary bytes to be valid UTF-8 text (the assumption you are making by trying to decode it to a string from UTF-8) is pretty optimistic. It'd be a coincidence if it worked!

One solution is to use the older pickling protocol that uses entirely ASCII characters. This still comes out as bytes, but since it is ASCII-only it can be decoded to a string without stress:

pickled = pickled.dumps(obj, 0).decode()

You could also use some other encoding method to encode a binary-pickled object to text, such as base64:

import codecs
pickled = codecs.encode(pickle.dumps(obj), "base64").decode()

Decoding would then be:

unpickled = pickle.loads(codecs.decode(pickled.encode(), "base64"))

Using pickle with protocol 0 seems to result in shorter strings than base64-encoding binary pickles (and abarnert's suggestion of hex-encoding is going to be even larger than base64), but I haven't tested it rigorously or anything. Test it with your data and see.



回答2:

If you want to store bytes in the environment, instead of encoded text, that's what environb is for.

This doesn't work on Windows. (As the docs imply, you should check os.supports_bytes_environ if you're on 3.2+ instead of just assuming that Unix does and Windows doesn't…) So for that, you'll need to smuggle the bytes into something that can be encoded no matter what your system encoding is, e.g., using backslash-escape, or even hex. So, for example:

if os.supports_bytes_environ:
    environb['pickled'] = pickled
else:
    environ['pickled'] = codecs.encode(pickled, 'hex')


回答3:

I think the simplest answer, especially if you don't care about Windows, is to just store the bytes in the environment, as suggested in my other answer.

But if you want something clean and debuggable, you might be happier using something designed as a text-based format.

pickle does have a "plain text" protocol 0, as explained in kindall's answer. It's certainly more readable than protocol 3 or 4, but it's still not something I'd actually want to read.

JSON is much nicer, but it can't handle datetime out of the box. You can come up with your own encoding (the stdlib's json module is extensible) for the handful of types you need to encode, or use something like jsonpickle. It's generally safer, more efficient, and more readable to come up with custom encodings for each type you care about than a general "pack arbitrary types in a turing-complete protocol" scheme like pickle or jsonpickle, but of course it's also more work, especially if you have a lot of extra types.

JSON Schema lets you define languages in JSON, similar to what you'd do in XML. It comes with a built-in date-time String format, and the jsonschema library for Python knows how to use it.

YAML has a standard extension repository that includes many types JSON doesn't, including a timestamp. Most of the zillion 'yaml' modules for Python already know how to encode datetime objects to and from this type. If you need additional types beyond what YAML includes, it was designed to be extensible declaratively. And there are libraries that do the equivalent of jsonpickle, defining new types on the fly, if you really need that.

And finally, you can always write an XML language.