Another superbly comprehensive answer from: comp.lang.python by Steven D'Aprano (I've tried to format it for stackoverflow):
directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o'
simply produces a nine-character string U+004A, U+0061, U+006C,
U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not
what you intended.This is because in UTF-8, the multibyte sequence
\xc3\xb1
is supposed to represent the single character U+00F1, not the
two characters U+00C3 and U+00B1.
This demonstrates confusion of the fundamental concepts, while still
accidentally stumbling across the basic facts right. No wonder it is
confusing you, it confuses me too! :-)
Encoding does not generate a character string, it generates bytes. So the
person you are quoting is causing confusion when he talks about an
"encoded string", he should either make it clear he means a string of
bytes, or not mention the word string at all. Either of these would work:
For older versions of Python (2.5 or older), unfortunately the b''
notation does not work, and you have to leave out the b
.
Even better would be if Python did not conflate ASCII characters with
bytes, and forced you to write byte strings like this:
- a UTF-8 encoded byte-string
b'\x4a\x61\x6c\x61\x70\x65\xc3\xb1\x6f'
thus keeping the distinction between ASCII characters and bytes clear.
But that would break backwards compatibility way too much, and so
Python continues to conflate ASCII characters with bytes, even in Python.
The important thing here is that bytes b'Jalape\xc3\xb1o'
consists of
nine hexadecimal values, as shown above. Seven of them represent the
ASCII characters Jalape
and o
and two of them are not ASCII. Their
meaning depends on what encoding you are using.
(To be precise, even the meaning of the other seven bytes depends on the
encoding. Fortunately, or unfortunately as the case may be, most but
not all encodings use the same hex values for ASCII characters as ASCII
itself does, so I will stop mentioning this and just pretend that
character J
always equals hex byte 4A
. But now you know the truth.)
Since we're using the UTF-8 encoding, the two bytes \xc3\xb1
represent
the character ñ
, also known as LATIN SMALL LETTER N WITH TILDE
. In other
encodings, those two bytes will represent something different.
So, I presume that the original person's intention was to get a Unicode
text string 'Jalapeño'
. If they were wise in the ways of Unicode, they
would write one of these:
'Jalape\N{LATIN SMALL LETTER N WITH TILDE}o'
'Jalape\u00F1o'
'Jalape\U000000F1o'
'Jalape\xF1o' # hex
'Jalape\361o' # octal
and be happy. (In Python 2, they would need to prefix all of these with
u
, to use Unicode strings instead of byte strings.)
But alas they have been misled by those who propagate myths,
misunderstandings and misapprehensions about Unicode all over the
Internet, and so they looked up ñ
somewhere, discovered that it has the
double-byte hex value c3b1
in UTF-8, and thought they could write this:
'Jalape\xc3\xb1o'
This does not do what they think it does. It creates a text string, a
Unicode string, with NINE characters:
J a l a p e à ± o
Why? Because character Ã
has ordinal value 195, which is c3
in hex, hence
\xc3
is the character Ã
; likewise \xb1
is the character ±
which has
ordinal value 177 (b1
in hex). And so they have discovered the wickedness
that is mojibake.
Instead, if they had started with a byte-string, and explicitly decoded
it as UTF-8, they would have been fine:
# I manually encoded 'Jalapeño' to get the bytes below:
bytes = b'Jalape\xc3\xb1o'
print(bytes.decode('utf-8'))
My original question was: Shouldn't this be 8 characters - not 9? He
says: \xc3\xb1
is supposed to represent the single character. However
after some interaction with fellow Pythonistas I'm even more confused.
Depends on the context. \xc3\xb1
could mean the Unicode string
'\xc3\xb1'
(in Python 2, written u'\xc3\xb1'
) or it could mean the byte-
string b'\xc3\xb1'
(in Python 2.5 or older, written without the b
).
As a string, \xc3\xb1
means two characters, with ordinal values 0xC3
(or
decimal 195) and 0xB1
(or decimal 177), namely 'Ã'
and '±'
.
As bytes, \xc3\xb1
represent two bytes (well, duh), which could mean
nearly anything:
the 16-bit Big Endian integer 50097
the 16-bit Little Endian integer 45507
a 4x4 black and white bitmap
the character '簽'
(CJK UNIFIED IDEOGRAPH-7C3D) in Big5 encoded bytes
'뇃'
(HANGUL SYLLABLE NWAES) in UTF-16 (Big Endian) encoded bytes
'ñ'
in UTF-8 encoded bytes
the two characters 'ñ'
in Latin-1 encoded bytes
'ñ'
in MacRoman encoded bytes
'Γ±'
in ISO-8859-7 encoded bytes
and so forth. Without knowing the context, there is no way of telling
what those two bytes represent, or whether they need to be taken together
as a pair, or as two distinct things.
With reference to the above para:
What does he mean by "writing a raw UTF-8 encoded string"??
He means he is confused. You don't get a text string by encoding, you get
bytes (I will accept "byte-string"). The adjective "raw" doesn't really
mean anything in this context. You have bytes that were encoded, or you
have a string containing characters. Raw doesn't really mean anything
except "hey, pay attention, this is low-level stuff" (for some definition
of "low level").
In Python2, once can do 'Jalape funny-n o'.
Nothing funny about it to Spanish speakers.
Personally, I have always considered "o" to be pretty funny. Say "woman"
and "women" aloud -- in the first one, it sounds like "w-oo-man", in the
second it sounds like "w-i-men". Now that's funny. But I digress.
If you type 'Jalapeño'
in Python 2 (with or without the b
prefix), the
result you get will depend on your terminal settings, but the chances are
high that the terminal will internally represent the string as UTF-8,
which gives you bytes
b'Jalape\xc3\xb1o'
which is nine bytes. When printed, your terminal will try to print each
byte separately, giving:
- byte
\x4a
prints as J
- byte
\x61
prints as a
- byte
\x6c
prints as l
- ...
and so forth. If you are unlucky your terminal may even be smart enough
to print the two bytes \xc3\xb1
as one character, giving you the ñ
you
were hoping for. Why unlucky? Because you got the right result by
accident. Next time you do the same thing, on a different terminal, or
the same terminal set to a different encoding, you will get a completely
different result, and think that Unicode is too messed up to use.
Using Python 2.5, here I print the same string three times in a row,
changing the terminal's encoding each time:
py> print 'Jalape\xc3\xb1o' # terminal set to UTF-8
Jalapeño
py> print 'Jalape\xc3\xb1o' # and ISO-8859-6 (Arabic)
Jalapeأ�o
py> print 'Jalape\xc3\xb1o' # and ISO-8859-5 (Cyrillic)
JalapeУБo
Which one is "right"? Answer: none of them. Not even the first, which by
accident just happened to be what we were hoping for.
Really, don't feel bad that you are confused. Between Python 2, and the
terminal trying really hard to do the right thing, it is easy to get
confused because something the right thing happens and sometimes it
doesn't.
This is a 'bytes' string where each glyph is 1 byte long
Nope. It's a string of characters. Glyphs don't come into it. Glyphs are
the little pictures of letters that you see on the screen, or printed on
paper. They could be bitmaps, or fancy vector graphics. They are unlikely
to be one byte each -- more likely 200 bytes per glyph, based on a very
rough calculation1, but depending on whether it is a bitmap, a
Postscript font, an OpenType font, or something else.
when stored internally so each glyph is
associated with an integer as per charset ASCII or Latin-1. If these
charsets have a funny-n glyph then yay! else nay! There is no UTF-8
here!! or UTF-16!! These are plain bytes (8 bits).
You're getting closer. But you are right: Python 2 "strings" are byte-
strings, which means UTF-8 doesn't come into it. But your terminal might
treat those bytes as UTF-8, and so accidentally do the "right" (wrong)
thing.
Unicode is a really big mapping table between glyphs and integers and
Not glyphs. Between abstract "characters" and integers, called Code
Points. Unicode contains:
- distinct letters, digits, characters
- accented letters
- accents on their own
- symbols, emoticons
- ligatures and variant forms of characters
- chars required only for backwards-compatibility with older encodings
- whitespace
- control characters
- code points reserved for private use, which can mean anything you like
- code points reserved as "will never be used"
- code points explicitly labelled "not a character"
and possibly others I have forgotten.
are denoted as Uxxxx
or Uxxxx-xxxx
.
The official Unicode notation is:
U+xxxx
U+xxxxx
U+xxxxxx
that is U+
followed by exactly four, five or six hex digits. The U
is
always uppercase. Unfortunately Python doesn't support that notation, and
you have to use either four or eight hex digits, e.g.:
\uFFFF
\U0010FFFF
For code points (ordinals) up to 255, you can also use hex or octal
escapes, e.g. \xFF
\3FF
UTF-8 UTF-16 are encodings to store
those big integers in an efficient manner.
Almost correct. They're not necessarily efficient.
Unicode code points are just abstract numbers that we give some meaning
to. Code point 65 (U+0041
, because hex 41 == decimal 65) means letter A
,
and so forth. Imagine these abstract code points floating in your head.
How do you get the abstract concept of a code point into concrete form on
a computer? The same way everything is put in a computer: as bytes, so
we have to turn each abstract code point (a number) into a series of
bytes.
Unicode code points range from U+0000
to U+10FFFF
, which means we could
just use exactly three bytes, which take values from 000000 to 10FFFF in
hexadecimal. Values outside of this range, say 110000, would be an error.
For reasons of efficiency, it's faster and better to use four bytes,
even though one of the four will always have the value zero.
In a nutshell, that's the UTF-32 encoding: ever character uses exactly
four bytes. E.g. code point U+0041
(character A
) is hex bytes 00000041
,
or possible 41000000
, depending on whether your computer is Big Endian or
Little Endian.
Since most text uses quite low ordinal values, that's awfully wasteful
of memory. So UTF-16 uses just two bytes per character, and a weird
scheme using so-called "surrogate pairs" for everything that won't fit
into two bytes. It works, for some definition of "works", but is
complicated, and you really want to avoid UTF-16 if you need code points
above U+FFFF
.
UTF-8 uses a neat variable encoding where characters with low ordinal
values get encoded as a single byte (better still: it is the same byte as
ASCII uses, which means old software that assumes everything in the world
is ASCII will keep working, well mostly working). Higher ordinals get
encoded as two, three or four bytes2. Best of all, unlike most
historical variable-width encodings, UTF-8 is self-synchronising. In
legacy encodings, if a single byte gets corrupted, it can mangle
everything from that point on. With UTF-8, a single corrupted byte will
mangle only the single code-point containing it, everything following
will be okay.
So when DB says "writing a
raw UTF-8 encoded string" - well the only way to do this is to use
Python3 where the default string literals are stored in Unicode which
then will use a UTF-8 UTF-16 internally to store the bytes in their
respective structures; or, one could use u'Jalape'
which is unicode in
both languages (note the leading u
).
Python never uses UTF-8 internally for storing strings in memory. Because
it is a variable width encoding, you cannot index strings efficiently if
they use UTF-8 for storage.
Instead, Python uses one of three different systems:
Up to Python 3.3, you have a choice. When you compile the Python
interpreter, you can choose whether it should use UTF-16 or UTF-32 for in-
memory storage. This choice is called "narrow" or "wide" build. A narrow
build uses less memory, but cannot handle code points above U+FFFF
very
well. A wide build uses more memory, but handles the complete range of
code points perfectly.
Starting in Python 3.3, the choice of how to store the string in memory
is no longer decided up front when you build the Python interpreter.
Instead, Python automatically chooses the most efficient internal
representation for each individual string. Strings which only use ASCII
or Latin-1 characters use one byte per character; string which use code
points up to U+FFFF
use two bytes per character; and only strings which
use code points above that use four bytes per character.
So assuming this is Python 3: 'Jalape \xYY \xZZ o'
(spaces for
readability) what DB is saying is that, the stupid-user would expect
Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2
o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8
characters. Correct?
Kind of. See above.
Which leaves me wondering what he means by: "This is because in
UTF-8, the multibyte sequence \xc3\xb1
is supposed to represent the
single character U+00F1
, not the two characters U+00C3
and U+00B1
"
He means that the single code point U+00F1
(character ñ
, n with a tilde)
is stored as the two bytes c3b1
(in hexadecimal) if you encode it using
UTF-8. But if you stuff characters \xc3
\xb1
into a Unicode string
(instead of bytes), then you get two Unicode characters U+00C3
and U+00B1
.
To put it another way, inside strings, Python treats the hex escape \xC3
as just a different way of writing the Unicode code point \u00C3
or
\U000000C3
.
However, if you create a byte-string:
b'Jalape\xc3\xb1o'
by looking up a table of UTF-8 encodings, as presumably the original
poster did, and then decode those bytes to a string, you will get what
you expect. Using Python 2.5, where the b
prefix is not needed:
py> tasty = 'Jalape\xc3\xb1o' # actually bytes
py> tasty.decode('utf-8')
u'Jalape\xf1o'
py> print tasty.decode('utf-8') # oops I forgot to reset my terminal
JalapeУБo
py> print tasty.decode('utf-8') # terminal now set to UTF-8
Jalapeño
1 Assume the font file is 100K in size, and it has glyphs for 256
characters. That works out to 195 bytes per glyph.
2 Technically, the UTF-8 scheme can handle 31-bit code points, up to
the (hypothetical) code point U+7FFFFFFF, using up to six bytes per code
point. But Unicode officially will never go past U+10FFFF, and so UTF-8
also will never go past four bytes per code point.
https://groups.google.com/forum/#!topic/comp.lang.python/1boxbYjhClg
Joshua Landau (answering my question wrote)
"directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1."
Correct.
My original question was: Shouldn't this be 8 characters - not 9?
No, Python tends to be right on these things.
He says: \xc3\xb1 is supposed to represent the single character. However after some interaction with fellow Pythonistas i'm even more confused.
You would be, given the way he said it.
With reference to the above para:
1. What does he mean by "writing a raw UTF-8 encoded string"??
Well, that doesn't really mean much with no context like he gave it.
In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string where each glyph is 1 byte long when stored internally so each glyph is associated with an integer as per charset ASCII or Latin-1. If these charsets have a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These are plain bytes (8 bits).
Unicode is a really big mapping table between glyphs and integers and are denoted as Uxxxx or Uxxxx-xxxx.
Waits for our resident unicode experts to explain why you're actually wrong
UTF-8 UTF-16 are encodings to store those big integers in an efficient manner. So when DB says "writing a raw UTF-8 encoded string" - well the only way to do this is to use Python3 where the default string literals are stored in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes in their respective structures; or, one could use u'Jalape' which is unicode in both languages (note the leading 'u').
Correct.
- So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readability) what DB is saying is that, the stupid-user would expect Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?
I think so.
- Which leaves me wondering what he means by:
"This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1"
He's mixed some things up, AFAICT.
Could someone take the time to read carefully and clarify what DB is saying??
Here's a simple explanation: you're both wrong (or you're both almost right):
As of Python 3:
>>> "\xc3\xb1"
'ñ'
>>> b"\xc3\xb1".decode()
'ñ'
"WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain.
Python 3's strings want you to give each character separately (*winces
in case I'm wrong*). Python is interpreting the "\xc3"
as "\N{LATIN
CAPITAL LETTER A WITH TILDE}"
and "\xb1"
as "\N{PLUS-MINUS SIGN}"
¹.
This means that Python is given two characters. Python is basically
doing this:
number = int("c3", 16) # Convert from base16
chr(number) # Turn to the character from the Unicode mapping
When you give Python raw bytes, you are saying that this is what the
string looks like when encoded -- you are not giving Python Unicode,
but encoded Unicode. This means that when you decode it (.decode())
it is free to convert multibyte sections to their relevant characters.
To see how an encoded string is not the same as the string itself, see:
>>> "Jalepeño".encode("ASCII", errors="xmlcharrefreplace")
b'Jalepeño'
Those represent the same thing, but the first (according to Python)
is the thing, the second needs to be decoded.
Now, bringing this back to the original:
>>> "\xc3\xb1".encode()
b'\xc3\x83\xc2\xb1'
You can see that the encoded bytes represent the two characters;
the string you see above is not the encoded one. The encoding is
internal to Python.
I hope that helps; good luck.
¹ Note that I find the "\N{...}"
form much easier to read, and recommend it.