According to page 29 of Python Essential Reference (4th Edition) by David Beazley:
directly writing a raw UTF-8 encoded string such as
'Jalape\xc3\xb1o'
simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multibyte sequence\xc3\xb1
is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1.
Shouldn't this be 8 characters - not 9? He says: \xc3\xb1
is supposed to represent the single character.
https://groups.google.com/forum/#!topic/comp.lang.python/1boxbYjhClg
Joshua Landau (answering my question wrote)
Correct.
No, Python tends to be right on these things.
You would be, given the way he said it.
Well, that doesn't really mean much with no context like he gave it.
Waits for our resident unicode experts to explain why you're actually wrong
Correct.
I think so.
He's mixed some things up, AFAICT.
Here's a simple explanation: you're both wrong (or you're both almost right):
As of Python 3:
"WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain.
Python 3's strings want you to give each character separately (*winces in case I'm wrong*). Python is interpreting the
"\xc3"
as"\N{LATIN CAPITAL LETTER A WITH TILDE}"
and"\xb1"
as"\N{PLUS-MINUS SIGN}"
¹. This means that Python is given two characters. Python is basically doing this:When you give Python raw bytes, you are saying that this is what the string looks like when encoded -- you are not giving Python Unicode, but encoded Unicode. This means that when you decode it (.decode()) it is free to convert multibyte sections to their relevant characters.
To see how an encoded string is not the same as the string itself, see:
Those represent the same thing, but the first (according to Python) is the thing, the second needs to be decoded.
Now, bringing this back to the original:
You can see that the encoded bytes represent the two characters; the string you see above is not the encoded one. The encoding is internal to Python.
I hope that helps; good luck.
¹ Note that I find the
"\N{...}"
form much easier to read, and recommend it.No, the statement is correct.
In UTF-8
\xc3\xb1
is supposed to represent a single character. That is, if you decoded the string from UTF-8, you'd get a single character and therefore 8 characters.However, in the particular example the string is treated as a raw sequence of characters and not UTF-8. Therefore, the two octets result in two characters.
I could be going a bit forward but see the following output of ipython:
The code above is for Python 3. For Python 2, byte strings (
b'Jalape\xc3\xb1o'
) would be replaced with regular strings ('Jalape\xc3\xb1o'
), and regular strings would be replaced with unicode strings (u'Jalape\xf1o'
).Another superbly comprehensive answer from: comp.lang.python by Steven D'Aprano (I've tried to format it for stackoverflow):
This demonstrates confusion of the fundamental concepts, while still accidentally stumbling across the basic facts right. No wonder it is confusing you, it confuses me too! :-)
Encoding does not generate a character string, it generates bytes. So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all. Either of these would work:
a UTF-8 encoded byte-string
b'Jalape\xc3\xb1o'
UTF-8 encoded bytes
b'Jalape\xc3\xb1o'
For older versions of Python (2.5 or older), unfortunately the
b''
notation does not work, and you have to leave out theb
.Even better would be if Python did not conflate ASCII characters with bytes, and forced you to write byte strings like this:
b'\x4a\x61\x6c\x61\x70\x65\xc3\xb1\x6f'
thus keeping the distinction between ASCII characters and bytes clear. But that would break backwards compatibility way too much, and so Python continues to conflate ASCII characters with bytes, even in Python.
The important thing here is that bytes
b'Jalape\xc3\xb1o'
consists of nine hexadecimal values, as shown above. Seven of them represent the ASCII charactersJalape
ando
and two of them are not ASCII. Their meaning depends on what encoding you are using.(To be precise, even the meaning of the other seven bytes depends on the encoding. Fortunately, or unfortunately as the case may be, most but not all encodings use the same hex values for ASCII characters as ASCII itself does, so I will stop mentioning this and just pretend that character
J
always equals hex byte4A
. But now you know the truth.)Since we're using the UTF-8 encoding, the two bytes
\xc3\xb1
represent the characterñ
, also known asLATIN SMALL LETTER N WITH TILDE
. In other encodings, those two bytes will represent something different.So, I presume that the original person's intention was to get a Unicode text string
'Jalapeño'
. If they were wise in the ways of Unicode, they would write one of these:'Jalape\N{LATIN SMALL LETTER N WITH TILDE}o'
'Jalape\u00F1o'
'Jalape\U000000F1o'
'Jalape\xF1o' # hex
'Jalape\361o' # octal
and be happy. (In Python 2, they would need to prefix all of these with
u
, to use Unicode strings instead of byte strings.)But alas they have been misled by those who propagate myths, misunderstandings and misapprehensions about Unicode all over the Internet, and so they looked up
ñ
somewhere, discovered that it has the double-byte hex valuec3b1
in UTF-8, and thought they could write this:This does not do what they think it does. It creates a text string, a Unicode string, with NINE characters:
Why? Because character
Ã
has ordinal value 195, which isc3
in hex, hence\xc3
is the characterÃ
; likewise\xb1
is the character±
which has ordinal value 177 (b1
in hex). And so they have discovered the wickedness that is mojibake.Instead, if they had started with a byte-string, and explicitly decoded it as UTF-8, they would have been fine:
Depends on the context.
\xc3\xb1
could mean the Unicode string'\xc3\xb1'
(in Python 2, writtenu'\xc3\xb1'
) or it could mean the byte- stringb'\xc3\xb1'
(in Python 2.5 or older, written without theb
).As a string,
\xc3\xb1
means two characters, with ordinal values0xC3
(or decimal 195) and0xB1
(or decimal 177), namely'Ã'
and'±'
.As bytes,
\xc3\xb1
represent two bytes (well, duh), which could mean nearly anything:the 16-bit Big Endian integer 50097
the 16-bit Little Endian integer 45507
a 4x4 black and white bitmap
the character
'簽'
(CJK UNIFIED IDEOGRAPH-7C3D) in Big5 encoded bytes'뇃'
(HANGUL SYLLABLE NWAES) in UTF-16 (Big Endian) encoded bytes'ñ'
in UTF-8 encoded bytesthe two characters
'ñ'
in Latin-1 encoded bytes'ñ'
in MacRoman encoded bytes'Γ±'
in ISO-8859-7 encoded bytesand so forth. Without knowing the context, there is no way of telling what those two bytes represent, or whether they need to be taken together as a pair, or as two distinct things.
He means he is confused. You don't get a text string by encoding, you get bytes (I will accept "byte-string"). The adjective "raw" doesn't really mean anything in this context. You have bytes that were encoded, or you have a string containing characters. Raw doesn't really mean anything except "hey, pay attention, this is low-level stuff" (for some definition of "low level").
Nothing funny about it to Spanish speakers.
Personally, I have always considered "o" to be pretty funny. Say "woman" and "women" aloud -- in the first one, it sounds like "w-oo-man", in the second it sounds like "w-i-men". Now that's funny. But I digress.
If you type
'Jalapeño'
in Python 2 (with or without theb
prefix), the result you get will depend on your terminal settings, but the chances are high that the terminal will internally represent the string as UTF-8, which gives you byteswhich is nine bytes. When printed, your terminal will try to print each byte separately, giving:
\x4a
prints asJ
\x61
prints asa
\x6c
prints asl
and so forth. If you are unlucky your terminal may even be smart enough to print the two bytes
\xc3\xb1
as one character, giving you theñ
you were hoping for. Why unlucky? Because you got the right result by accident. Next time you do the same thing, on a different terminal, or the same terminal set to a different encoding, you will get a completely different result, and think that Unicode is too messed up to use.Using Python 2.5, here I print the same string three times in a row, changing the terminal's encoding each time:
Which one is "right"? Answer: none of them. Not even the first, which by accident just happened to be what we were hoping for.
Really, don't feel bad that you are confused. Between Python 2, and the terminal trying really hard to do the right thing, it is easy to get confused because something the right thing happens and sometimes it doesn't.
Nope. It's a string of characters. Glyphs don't come into it. Glyphs are the little pictures of letters that you see on the screen, or printed on paper. They could be bitmaps, or fancy vector graphics. They are unlikely to be one byte each -- more likely 200 bytes per glyph, based on a very rough calculation1, but depending on whether it is a bitmap, a Postscript font, an OpenType font, or something else.
You're getting closer. But you are right: Python 2 "strings" are byte- strings, which means UTF-8 doesn't come into it. But your terminal might treat those bytes as UTF-8, and so accidentally do the "right" (wrong) thing.
Not glyphs. Between abstract "characters" and integers, called Code Points. Unicode contains:
and possibly others I have forgotten.
The official Unicode notation is:
that is
U+
followed by exactly four, five or six hex digits. TheU
is always uppercase. Unfortunately Python doesn't support that notation, and you have to use either four or eight hex digits, e.g.:For code points (ordinals) up to 255, you can also use hex or octal escapes, e.g.
\xFF
\3FF
Almost correct. They're not necessarily efficient.
Unicode code points are just abstract numbers that we give some meaning to. Code point 65 (
U+0041
, because hex 41 == decimal 65) means letterA
, and so forth. Imagine these abstract code points floating in your head. How do you get the abstract concept of a code point into concrete form on a computer? The same way everything is put in a computer: as bytes, so we have to turn each abstract code point (a number) into a series of bytes.Unicode code points range from
U+0000
toU+10FFFF
, which means we could just use exactly three bytes, which take values from 000000 to 10FFFF in hexadecimal. Values outside of this range, say 110000, would be an error. For reasons of efficiency, it's faster and better to use four bytes, even though one of the four will always have the value zero.In a nutshell, that's the UTF-32 encoding: ever character uses exactly four bytes. E.g. code point
U+0041
(characterA
) is hex bytes00000041
, or possible41000000
, depending on whether your computer is Big Endian or Little Endian.Since most text uses quite low ordinal values, that's awfully wasteful of memory. So UTF-16 uses just two bytes per character, and a weird scheme using so-called "surrogate pairs" for everything that won't fit into two bytes. It works, for some definition of "works", but is complicated, and you really want to avoid UTF-16 if you need code points above
U+FFFF
.UTF-8 uses a neat variable encoding where characters with low ordinal values get encoded as a single byte (better still: it is the same byte as ASCII uses, which means old software that assumes everything in the world is ASCII will keep working, well mostly working). Higher ordinals get encoded as two, three or four bytes2. Best of all, unlike most historical variable-width encodings, UTF-8 is self-synchronising. In legacy encodings, if a single byte gets corrupted, it can mangle everything from that point on. With UTF-8, a single corrupted byte will mangle only the single code-point containing it, everything following will be okay.
Python never uses UTF-8 internally for storing strings in memory. Because it is a variable width encoding, you cannot index strings efficiently if they use UTF-8 for storage.
Instead, Python uses one of three different systems:
Up to Python 3.3, you have a choice. When you compile the Python interpreter, you can choose whether it should use UTF-16 or UTF-32 for in- memory storage. This choice is called "narrow" or "wide" build. A narrow build uses less memory, but cannot handle code points above
U+FFFF
very well. A wide build uses more memory, but handles the complete range of code points perfectly.Starting in Python 3.3, the choice of how to store the string in memory is no longer decided up front when you build the Python interpreter. Instead, Python automatically chooses the most efficient internal representation for each individual string. Strings which only use ASCII or Latin-1 characters use one byte per character; string which use code points up to
U+FFFF
use two bytes per character; and only strings which use code points above that use four bytes per character.Kind of. See above.
He means that the single code point
U+00F1
(characterñ
, n with a tilde) is stored as the two bytesc3b1
(in hexadecimal) if you encode it using UTF-8. But if you stuff characters\xc3
\xb1
into a Unicode string (instead of bytes), then you get two Unicode charactersU+00C3
andU+00B1
.To put it another way, inside strings, Python treats the hex escape
\xC3
as just a different way of writing the Unicode code point\u00C3
or\U000000C3
.However, if you create a byte-string:
by looking up a table of UTF-8 encodings, as presumably the original poster did, and then decode those bytes to a string, you will get what you expect. Using Python 2.5, where the
b
prefix is not needed:1 Assume the font file is 100K in size, and it has glyphs for 256 characters. That works out to 195 bytes per glyph.
2 Technically, the UTF-8 scheme can handle 31-bit code points, up to the (hypothetical) code point U+7FFFFFFF, using up to six bytes per code point. But Unicode officially will never go past U+10FFFF, and so UTF-8 also will never go past four bytes per code point.