Python “\x00” filled / utf-32 string from cStringI

Through cStringIO of another system, I wrote some unicode via:

u'content-length'.encode('utf-8')

and on reading this back using, unicode( stringio_fd.read(),'utf-8'), I get:

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00'

printing the above in the terminal gives me the right value, but of course, I can't do anything useful:

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00")

content-length

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00") == u'content-length'

False

What's the quickest, cheapest way to turn this string into a string equivalent to u'content-type' ? I can't change from cStringIO

Updates

While philhag's answer is correct, it appears the problem is:

StringIO.StringIO(u'content-type').getvalue().encode('utf-8')

'content-type'

StringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'content-type'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00t\x00\x00\x00y\x00\x00\x00p\x00\x00\x00e\x00\x00\x00'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8').decode('utf-32')

u'content-type'

标签： python unicode

2条回答

Root（大扎）

2楼-- · 2019-07-27 08:02

The root cause is that cStringIO.StringIO(unicode_object) produces a nonsense.

The current 2.X docs on docs.python.org say

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

This is unhelpful and incorrect; see below. The chm version of the docs supplied with the win32 installer for CPython 2.7.2 and 2.6.6 follow that with this sentence:

Calling StringIO() with a Unicode string parameter populates the object with the buffer representation of the Unicode string instead of encoding the string.

This is a correct description of the behaviour (see below). The behaviour is not brilliant. I can't imagine a good reason for that sentence being removed from the web docs.

Behaving badly:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
>>> import StringIO, cStringIO, sys
>>> StringIO.StringIO(u"fubar").getvalue()
u'fubar' <<=== unicode object
>>> cStringIO.StringIO(u"fubar").getvalue()
'f\x00u\x00b\x00a\x00r\x00' <<=== str object
cStringIO.StringIO(u"\u0405\u0406").getvalue()
'\x05\x04\x06\x04' <<=== "accepts"
>>> sys.maxunicode
65535 # your sender presumably emits 1114111 (wide unicode)
>>> sys.byteorder
'little'

So in general all one needs to do is know/guess the endianness and unicode-width of the sender's Python and decode the mess with UTF-(16|32)-(B|L)E.

In your case the sender is being rather Byzantine; for example u'content-length'.encode('utf-8') is the str object 'content-length' which bears a remarkable similarity to what you started with. Also foo.encode(utf8').decode('utf8') produces either foo or an exception.

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-07-27 08:18

Something along the way is encoding your values as UTF-32. Simply decode them:

>>> b = u"c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00"
>>> b.decode('utf-32')
u'content-length'

0人赞添加讨论(0) 举报

Python “\x00” filled / utf-32 string from cStringI

Updates

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间