I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml
and decode a base64 string that is in that XML.
The code now works in the following manner:
I retrieve the binary data with an XPath query '.../binary/text()'
. This produces a one-element list containing a lxml.etree._ElementUnicodeResult
object. Then, with python 2, I was able to do:
decoded = source.decode('base64')
and finally
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult
is a subclass of str
.
Another way would be to get a real str
with the same data in it with
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64
module, but it takes a bytes
object as an argument, and I can't think of the way to present str
as bytes
, because if I try to construct a bytes
object, Python will try to encode the string, which I don't need.
Googling further, I came across the binascii
module (which is invoked indirectly from base64
anyway, if I'm not mistaken), but calling binascii.b2a_base64()
on my string produces
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex()
so I don't see how it would be helpful.
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.
I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?
OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
The credit totally goes to thebjorn and delnan, of course.
So, starting with the most common things:
there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
In Python 2, str
is actually bytes, and unicode
is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
In Python 3, str
is always a Unicode string, and bytes
is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning.
Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of float
s). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode()
function takes a byte string as the argument. In Python 2 it means: str
. In Python 3 it means: bytes
. So if you have a str
, such as
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
>>> s.decode('base64')
because s
is already in ASCII.
In Python 3, you need to encode it in ASCII first, so you'll have to do:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes
object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :)
In Python 2 however it will be just a str
. Anyway, have a look at struct
for the tools to unpack your data from those bytes.
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii')
won't effectively do anything because it's applied to str
. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii')
will return a unicode
object on Python 2.