Following this python example, I encode a string as Base64 with:
>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'
But, if I leave out the leading b
:
>>> encoded = base64.b64encode('data to be encoded')
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\base64.py", line 56, in b64encode
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
Why is this?
If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"
There is all you need:
The leading
b
makes your string binary.What version of Python do you use? 2.x or 3.x?
Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x
If the string is unicode the easiest way is:
Short Answer
You need to push a
bytes-like
object (bytes
,bytearray
, etc) to thebase64.b64encode()
method. Here are two ways:Or with a variable:
Why?
In Python 3,
str
objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take astring
and call the.encode()
method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.Base-64 Encoding in Python 3
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64
encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.In
base64
encoding, the translation is done from left to right; those first 64 characters are why it is calledbase64
encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.Example:
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base64
encoding, however, will re-group this data thusly:So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However,
base64
encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, thebase64
encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.Let's test this to see if I am being dishonest:
Why use
base64
encoding?Let's say I have to send some data to someone via email, like this data:
There are two problems I planted:
\x04
character was read, because that is ASCII forEND-OF-TRANSMISSION
(Ctrl-D), so the remaining data would be left out of the transmission.BACKSPACE
characters and threeSPACE
characters to erase the 'msg'. Thus, even if I didn't have theEOF
character there the end user wouldn't be able to translate from the text on screen to the real, raw data.This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.
base64 encoding takes 8-bit binary byte data and encodes it uses only the characters
A-Z
,a-z
,0-9
,+
,/
* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the
b''
syntax.If you remove the
b
, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)In your second example:
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
Or simpler:
Which would be the same thing in this case.
* Most base64 flavours may also include a
=
at the end as padding. In addition, some base64 variants may use characters other than+
and/
. See the Variants summary table at Wikipedia for an overview.