Python “\x00” filled / utf-32 string from cStringI

2019-07-27 07:50发布

Through cStringIO of another system, I wrote some unicode via:

u'content-length'.encode('utf-8')

and on reading this back using, unicode( stringio_fd.read(),'utf-8'), I get:

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00'

printing the above in the terminal gives me the right value, but of course, I can't do anything useful:

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00")

content-length

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00") == u'content-length'

False

What's the quickest, cheapest way to turn this string into a string equivalent to u'content-type' ? I can't change from cStringIO


Updates

While philhag's answer is correct, it appears the problem is:

StringIO.StringIO(u'content-type').getvalue().encode('utf-8')

'content-type'

StringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'content-type'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00t\x00\x00\x00y\x00\x00\x00p\x00\x00\x00e\x00\x00\x00'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8').decode('utf-32')

u'content-type'

2条回答
Root(大扎)
2楼-- · 2019-07-27 08:02

The root cause is that cStringIO.StringIO(unicode_object) produces a nonsense.

The current 2.X docs on docs.python.org say

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

This is unhelpful and incorrect; see below. The chm version of the docs supplied with the win32 installer for CPython 2.7.2 and 2.6.6 follow that with this sentence:

Calling StringIO() with a Unicode string parameter populates the object with the buffer representation of the Unicode string instead of encoding the string.

This is a correct description of the behaviour (see below). The behaviour is not brilliant. I can't imagine a good reason for that sentence being removed from the web docs.

Behaving badly:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
>>> import StringIO, cStringIO, sys
>>> StringIO.StringIO(u"fubar").getvalue()
u'fubar' <<=== unicode object
>>> cStringIO.StringIO(u"fubar").getvalue()
'f\x00u\x00b\x00a\x00r\x00' <<=== str object
cStringIO.StringIO(u"\u0405\u0406").getvalue()
'\x05\x04\x06\x04' <<=== "accepts"
>>> sys.maxunicode
65535 # your sender presumably emits 1114111 (wide unicode)
>>> sys.byteorder
'little'

So in general all one needs to do is know/guess the endianness and unicode-width of the sender's Python and decode the mess with UTF-(16|32)-(B|L)E.

In your case the sender is being rather Byzantine; for example u'content-length'.encode('utf-8') is the str object 'content-length' which bears a remarkable similarity to what you started with. Also foo.encode(utf8').decode('utf8') produces either foo or an exception.

查看更多
在下西门庆
3楼-- · 2019-07-27 08:18

Something along the way is encoding your values as UTF-32. Simply decode them:

>>> b = u"c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00"
>>> b.decode('utf-32')
u'content-length'
查看更多
登录 后发表回答