Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode
instead of str
, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode
strings using the escape char \
?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
EDIT:
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode
string seems to be encoded using latin1
instead of utf-8
and the raw string is encoded using utf-8
? I'm even more confused now! :S
unicode
is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g.utf-8
,latin-1
...).Note that
unicode
is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.On the contrary
str
in Python 2 is a plain sequence of bytes. It does not represent text!You can think of
unicode
as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented viastr
.Note: In Python 3,
unicode
was renamed tostr
and there is a newbytes
type for a plain sequence of bytes.Some differences that you can see:
Note that using
str
you have a lower-level control on the single bytes of a specific encoding representation, while usingunicode
you can only control at the code-point level. For example you can do:What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Your terminal happens to be configured to UTF-8.
The fact that printing
a
works is a coincidence; you are writing raw UTF-8 bytes to the terminal.a
is a value of length two, containing two bytes, hex values C3 and A1, whileua
is a unicode value of length one, containing a codepoint U+00E1.This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the
len()
of a byte string tells you how many bytes were used, not how many characters were encoded.You can see the difference when you encode the unicode value to different output encodings:
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using
\x..
escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence,\u....
is used instead, with a four-digit hex value.It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers,
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua