Through cStringIO of another system, I wrote some unicode via:
u'content-length'.encode('utf-8')
and on reading this back using, unicode( stringio_fd.read(),'utf-8')
, I get:
u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00'
printing the above in the terminal gives me the right value, but of course, I can't do anything useful:
print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00")
content-length
print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00") == u'content-length'
False
What's the quickest, cheapest way to turn this string into a string equivalent to u'content-type'
? I can't change from cStringIO
Updates
While philhag's answer is correct, it appears the problem is:
StringIO.StringIO(u'content-type').getvalue().encode('utf-8')
'content-type'
StringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')
u'content-type'
cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')
u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00t\x00\x00\x00y\x00\x00\x00p\x00\x00\x00e\x00\x00\x00'
cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8').decode('utf-32')
u'content-type'
The root cause is that
cStringIO.StringIO(unicode_object)
produces a nonsense.The current 2.X docs on docs.python.org say
This is unhelpful and incorrect; see below. The
chm
version of the docs supplied with the win32 installer for CPython 2.7.2 and 2.6.6 follow that with this sentence:This is a correct description of the behaviour (see below). The behaviour is not brilliant. I can't imagine a good reason for that sentence being removed from the web docs.
Behaving badly:
So in general all one needs to do is know/guess the endianness and unicode-width of the sender's Python and decode the mess with
UTF-(16|32)-(B|L)E
.In your case the sender is being rather Byzantine; for example
u'content-length'.encode('utf-8')
is thestr
object'content-length'
which bears a remarkable similarity to what you started with. Alsofoo.encode(utf8').decode('utf8')
produces eitherfoo
or an exception.Something along the way is encoding your values as UTF-32. Simply decode them: