Does the UTF8-encoded string 'Jalape\\xc3\\xb1

2019-03-22 04:45发布

站内文章 / Python

21 0

女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

According to page 29 of Python Essential Reference (4th Edition) by David Beazley:

directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multibyte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1.

Shouldn't this be 8 characters - not 9? He says: \xc3\xb1 is supposed to represent the single character.

回答1:

Another superbly comprehensive answer from: comp.lang.python by Steven D'Aprano (I've tried to format it for stackoverflow):

directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multibyte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1.

This demonstrates confusion of the fundamental concepts, while still accidentally stumbling across the basic facts right. No wonder it is confusing you, it confuses me too! :-)

Encoding does not generate a character string, it generates bytes. So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all. Either of these would work:

a UTF-8 encoded byte-string b'Jalape\xc3\xb1o'
UTF-8 encoded bytes b'Jalape\xc3\xb1o'

For older versions of Python (2.5 or older), unfortunately the b'' notation does not work, and you have to leave out the b.

Even better would be if Python did not conflate ASCII characters with bytes, and forced you to write byte strings like this:

a UTF-8 encoded byte-string b'\x4a\x61\x6c\x61\x70\x65\xc3\xb1\x6f'

thus keeping the distinction between ASCII characters and bytes clear. But that would break backwards compatibility way too much, and so Python continues to conflate ASCII characters with bytes, even in Python.

The important thing here is that bytes b'Jalape\xc3\xb1o' consists of nine hexadecimal values, as shown above. Seven of them represent the ASCII characters Jalape and o and two of them are not ASCII. Their meaning depends on what encoding you are using.

(To be precise, even the meaning of the other seven bytes depends on the encoding. Fortunately, or unfortunately as the case may be, most but not all encodings use the same hex values for ASCII characters as ASCII itself does, so I will stop mentioning this and just pretend that character J always equals hex byte 4A. But now you know the truth.)

Since we're using the UTF-8 encoding, the two bytes \xc3\xb1 represent the character ñ, also known as LATIN SMALL LETTER N WITH TILDE. In other encodings, those two bytes will represent something different.

So, I presume that the original person's intention was to get a Unicode text string 'Jalapeño'. If they were wise in the ways of Unicode, they would write one of these:

'Jalape\N{LATIN SMALL LETTER N WITH TILDE}o'
'Jalape\u00F1o'
'Jalape\U000000F1o'
'Jalape\xF1o' # hex
'Jalape\361o' # octal

and be happy. (In Python 2, they would need to prefix all of these with u, to use Unicode strings instead of byte strings.)

But alas they have been misled by those who propagate myths, misunderstandings and misapprehensions about Unicode all over the Internet, and so they looked up ñ somewhere, discovered that it has the double-byte hex value c3b1 in UTF-8, and thought they could write this:

'Jalape\xc3\xb1o'

This does not do what they think it does. It creates a text string, a Unicode string, with NINE characters:

J a l a p e Ã ± o

Why? Because character Ã has ordinal value 195, which is c3 in hex, hence \xc3 is the character Ã; likewise \xb1 is the character ± which has ordinal value 177 (b1 in hex). And so they have discovered the wickedness that is mojibake.

Instead, if they had started with a byte-string, and explicitly decoded it as UTF-8, they would have been fine:

# I manually encoded 'Jalapeño' to get the bytes below:
bytes = b'Jalape\xc3\xb1o'
print(bytes.decode('utf-8'))

My original question was: Shouldn't this be 8 characters - not 9? He says: \xc3\xb1 is supposed to represent the single character. However after some interaction with fellow Pythonistas I'm even more confused.

Depends on the context. \xc3\xb1 could mean the Unicode string '\xc3\xb1' (in Python 2, written u'\xc3\xb1') or it could mean the byte- string b'\xc3\xb1' (in Python 2.5 or older, written without the b).

As a string, \xc3\xb1 means two characters, with ordinal values 0xC3 (or decimal 195) and 0xB1 (or decimal 177), namely 'Ã' and '±'.

As bytes, \xc3\xb1 represent two bytes (well, duh), which could mean nearly anything:

the 16-bit Big Endian integer 50097
the 16-bit Little Endian integer 45507
a 4x4 black and white bitmap
the character '簽' (CJK UNIFIED IDEOGRAPH-7C3D) in Big5 encoded bytes
'뇃' (HANGUL SYLLABLE NWAES) in UTF-16 (Big Endian) encoded bytes
'ñ' in UTF-8 encoded bytes
the two characters 'Ã±' in Latin-1 encoded bytes
'√±' in MacRoman encoded bytes
'Γ±' in ISO-8859-7 encoded bytes

and so forth. Without knowing the context, there is no way of telling what those two bytes represent, or whether they need to be taken together as a pair, or as two distinct things.

With reference to the above para: What does he mean by "writing a raw UTF-8 encoded string"??

He means he is confused. You don't get a text string by encoding, you get bytes (I will accept "byte-string"). The adjective "raw" doesn't really mean anything in this context. You have bytes that were encoded, or you have a string containing characters. Raw doesn't really mean anything except "hey, pay attention, this is low-level stuff" (for some definition of "low level").

In Python2, once can do 'Jalape funny-n o'.

Nothing funny about it to Spanish speakers.

Personally, I have always considered "o" to be pretty funny. Say "woman" and "women" aloud -- in the first one, it sounds like "w-oo-man", in the second it sounds like "w-i-men". Now that's funny. But I digress.

If you type 'Jalapeño' in Python 2 (with or without the b prefix), the result you get will depend on your terminal settings, but the chances are high that the terminal will internally represent the string as UTF-8, which gives you bytes

b'Jalape\xc3\xb1o'

which is nine bytes. When printed, your terminal will try to print each byte separately, giving:

byte \x4a prints as J
byte \x61 prints as a
byte \x6c prints as l
...

and so forth. If you are unlucky your terminal may even be smart enough to print the two bytes \xc3\xb1 as one character, giving you the ñ you were hoping for. Why unlucky? Because you got the right result by accident. Next time you do the same thing, on a different terminal, or the same terminal set to a different encoding, you will get a completely different result, and think that Unicode is too messed up to use.

Using Python 2.5, here I print the same string three times in a row, changing the terminal's encoding each time:

py> print 'Jalape\xc3\xb1o'  # terminal set to UTF-8
Jalapeño
py> print 'Jalape\xc3\xb1o'  # and ISO-8859-6 (Arabic)
Jalapeأ�o
py> print 'Jalape\xc3\xb1o'  # and ISO-8859-5 (Cyrillic)
JalapeУБo

Which one is "right"? Answer: none of them. Not even the first, which by accident just happened to be what we were hoping for.

Really, don't feel bad that you are confused. Between Python 2, and the terminal trying really hard to do the right thing, it is easy to get confused because something the right thing happens and sometimes it doesn't.

This is a 'bytes' string where each glyph is 1 byte long

Nope. It's a string of characters. Glyphs don't come into it. Glyphs are the little pictures of letters that you see on the screen, or printed on paper. They could be bitmaps, or fancy vector graphics. They are unlikely to be one byte each -- more likely 200 bytes per glyph, based on a very rough calculation¹, but depending on whether it is a bitmap, a Postscript font, an OpenType font, or something else.

when stored internally so each glyph is associated with an integer as per charset ASCII or Latin-1. If these charsets have a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These are plain bytes (8 bits).

You're getting closer. But you are right: Python 2 "strings" are byte- strings, which means UTF-8 doesn't come into it. But your terminal might treat those bytes as UTF-8, and so accidentally do the "right" (wrong) thing.

Unicode is a really big mapping table between glyphs and integers and

Not glyphs. Between abstract "characters" and integers, called Code Points. Unicode contains:

distinct letters, digits, characters
accented letters
accents on their own
symbols, emoticons
ligatures and variant forms of characters
chars required only for backwards-compatibility with older encodings
whitespace
control characters
code points reserved for private use, which can mean anything you like
code points reserved as "will never be used"
code points explicitly labelled "not a character"

and possibly others I have forgotten.

are denoted as Uxxxx or Uxxxx-xxxx.

The official Unicode notation is:

U+xxxx
U+xxxxx
U+xxxxxx

that is U+ followed by exactly four, five or six hex digits. The U is always uppercase. Unfortunately Python doesn't support that notation, and you have to use either four or eight hex digits, e.g.:

\uFFFF
\U0010FFFF

For code points (ordinals) up to 255, you can also use hex or octal escapes, e.g. \xFF \3FF

UTF-8 UTF-16 are encodings to store those big integers in an efficient manner.

Almost correct. They're not necessarily efficient.

Unicode code points are just abstract numbers that we give some meaning to. Code point 65 (U+0041, because hex 41 == decimal 65) means letter A, and so forth. Imagine these abstract code points floating in your head. How do you get the abstract concept of a code point into concrete form on a computer? The same way everything is put in a computer: as bytes, so we have to turn each abstract code point (a number) into a series of bytes.

Unicode code points range from U+0000 to U+10FFFF, which means we could just use exactly three bytes, which take values from 000000 to 10FFFF in hexadecimal. Values outside of this range, say 110000, would be an error. For reasons of efficiency, it's faster and better to use four bytes, even though one of the four will always have the value zero.

In a nutshell, that's the UTF-32 encoding: ever character uses exactly four bytes. E.g. code point U+0041 (character A) is hex bytes 00000041, or possible 41000000, depending on whether your computer is Big Endian or Little Endian.

Since most text uses quite low ordinal values, that's awfully wasteful of memory. So UTF-16 uses just two bytes per character, and a weird scheme using so-called "surrogate pairs" for everything that won't fit into two bytes. It works, for some definition of "works", but is complicated, and you really want to avoid UTF-16 if you need code points above U+FFFF.

UTF-8 uses a neat variable encoding where characters with low ordinal values get encoded as a single byte (better still: it is the same byte as ASCII uses, which means old software that assumes everything in the world is ASCII will keep working, well mostly working). Higher ordinals get encoded as two, three or four bytes². Best of all, unlike most historical variable-width encodings, UTF-8 is self-synchronising. In legacy encodings, if a single byte gets corrupted, it can mangle everything from that point on. With UTF-8, a single corrupted byte will mangle only the single code-point containing it, everything following will be okay.

So when DB says "writing a raw UTF-8 encoded string" - well the only way to do this is to use Python3 where the default string literals are stored in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes in their respective structures; or, one could use u'Jalape' which is unicode in both languages (note the leading u).

Python never uses UTF-8 internally for storing strings in memory. Because it is a variable width encoding, you cannot index strings efficiently if they use UTF-8 for storage.

Instead, Python uses one of three different systems:

Up to Python 3.3, you have a choice. When you compile the Python interpreter, you can choose whether it should use UTF-16 or UTF-32 for in- memory storage. This choice is called "narrow" or "wide" build. A narrow build uses less memory, but cannot handle code points above U+FFFF very well. A wide build uses more memory, but handles the complete range of code points perfectly.
Starting in Python 3.3, the choice of how to store the string in memory is no longer decided up front when you build the Python interpreter. Instead, Python automatically chooses the most efficient internal representation for each individual string. Strings which only use ASCII or Latin-1 characters use one byte per character; string which use code points up to U+FFFF use two bytes per character; and only strings which use code points above that use four bytes per character.

So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readability) what DB is saying is that, the stupid-user would expect Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?

Kind of. See above.

Which leaves me wondering what he means by: "This is because in UTF-8, the multibyte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1"

He means that the single code point U+00F1 (character ñ, n with a tilde) is stored as the two bytes c3b1 (in hexadecimal) if you encode it using UTF-8. But if you stuff characters \xc3 \xb1 into a Unicode string (instead of bytes), then you get two Unicode characters U+00C3 and U+00B1.

To put it another way, inside strings, Python treats the hex escape \xC3 as just a different way of writing the Unicode code point \u00C3 or \U000000C3.

However, if you create a byte-string:

b'Jalape\xc3\xb1o'

by looking up a table of UTF-8 encodings, as presumably the original poster did, and then decode those bytes to a string, you will get what you expect. Using Python 2.5, where the b prefix is not needed:

py> tasty = 'Jalape\xc3\xb1o'  # actually bytes
py> tasty.decode('utf-8')
u'Jalape\xf1o'
py> print tasty.decode('utf-8')  # oops I forgot to reset my terminal
JalapeУБo
py> print tasty.decode('utf-8')  # terminal now set to UTF-8
Jalapeño

¹ Assume the font file is 100K in size, and it has glyphs for 256 characters. That works out to 195 bytes per glyph.

² Technically, the UTF-8 scheme can handle 31-bit code points, up to the (hypothetical) code point U+7FFFFFFF, using up to six bytes per code point. But Unicode officially will never go past U+10FFFF, and so UTF-8 also will never go past four bytes per code point.

回答2:

No, the statement is correct.

In UTF-8 \xc3\xb1 is supposed to represent a single character. That is, if you decoded the string from UTF-8, you'd get a single character and therefore 8 characters.

However, in the particular example the string is treated as a raw sequence of characters and not UTF-8. Therefore, the two octets result in two characters.

I could be going a bit forward but see the following output of ipython:

In [1]: b'Jalape\xc3\xb1o'
Out[1]: b'Jalape\xc3\xb1o'

In [2]: len(b'Jalape\xc3\xb1o')
Out[2]: 9

In [3]: b'Jalape\xc3\xb1o'.decode('utf8')
Out[3]: 'Jalapeño'

In [4]: len(b'Jalape\xc3\xb1o'.decode('utf8'))
Out[4]: 8

In [5]: 'Jalape\xf1o'
Out[5]: 'Jalapeño'

The code above is for Python 3. For Python 2, byte strings (b'Jalape\xc3\xb1o') would be replaced with regular strings ('Jalape\xc3\xb1o'), and regular strings would be replaced with unicode strings (u'Jalape\xf1o').

回答3:

https://groups.google.com/forum/#!topic/comp.lang.python/1boxbYjhClg

Joshua Landau (answering my question wrote)

"directly writing a raw UTF-8 encoded string such as 'Jalape\xc3\xb1o' simply produces a nine-character string U+004A, U+0061, U+006C, U+0061, U+0070, U+0065, U+00C3, U+00B1, U+006F, which is probably not what you intended.This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1."

Correct.

My original question was: Shouldn't this be 8 characters - not 9?

No, Python tends to be right on these things.

He says: \xc3\xb1 is supposed to represent the single character. However after some interaction with fellow Pythonistas i'm even more confused.

You would be, given the way he said it.

With reference to the above para: 1. What does he mean by "writing a raw UTF-8 encoded string"??

Well, that doesn't really mean much with no context like he gave it.

In Python2, once can do 'Jalape funny-n o'. This is a 'bytes' string where each glyph is 1 byte long when stored internally so each glyph is associated with an integer as per charset ASCII or Latin-1. If these charsets have a funny-n glyph then yay! else nay! There is no UTF-8 here!! or UTF-16!! These are plain bytes (8 bits).

Unicode is a really big mapping table between glyphs and integers and are denoted as Uxxxx or Uxxxx-xxxx.

Waits for our resident unicode experts to explain why you're actually wrong

UTF-8 UTF-16 are encodings to store those big integers in an efficient manner. So when DB says "writing a raw UTF-8 encoded string" - well the only way to do this is to use Python3 where the default string literals are stored in Unicode which then will use a UTF-8 UTF-16 internally to store the bytes in their respective structures; or, one could use u'Jalape' which is unicode in both languages (note the leading 'u').

Correct.

So assuming this is Python 3: 'Jalape \xYY \xZZ o' (spaces for readability) what DB is saying is that, the stupid-user would expect Jalapeno with a squiggly-n but instead he gets is: Jalape funny1 funny2 o (spaces for readability) -9 glyphs or 9 Unicode-points or 9-UTF8 characters. Correct?

I think so.

Which leaves me wondering what he means by: "This is because in UTF-8, the multi- byte sequence \xc3\xb1 is supposed to represent the single character U+00F1, not the two characters U+00C3 and U+00B1"

He's mixed some things up, AFAICT.

Could someone take the time to read carefully and clarify what DB is saying??

Here's a simple explanation: you're both wrong (or you're both almost right):

As of Python 3:

>>> "\xc3\xb1"
'Ã±'
>>> b"\xc3\xb1".decode()
'ñ'

"WHAT?!" you scream, "THAT'S WRONG!" But it's not. Let me explain.

Python 3's strings want you to give each character separately (*winces in case I'm wrong*). Python is interpreting the "\xc3" as "\N{LATIN CAPITAL LETTER A WITH TILDE}" and "\xb1" as "\N{PLUS-MINUS SIGN}"¹. This means that Python is given two characters. Python is basically doing this:

number = int("c3", 16) # Convert from base16
chr(number) # Turn to the character from the Unicode mapping

When you give Python raw bytes, you are saying that this is what the string looks like when encoded -- you are not giving Python Unicode, but encoded Unicode. This means that when you decode it (.decode()) it is free to convert multibyte sections to their relevant characters.

To see how an encoded string is not the same as the string itself, see:

>>> "Jalepeño".encode("ASCII", errors="xmlcharrefreplace")
b'Jalepe&#241;o'

Those represent the same thing, but the first (according to Python) is the thing, the second needs to be decoded.

Now, bringing this back to the original:

>>> "\xc3\xb1".encode()
b'\xc3\x83\xc2\xb1'

You can see that the encoded bytes represent the two characters; the string you see above is not the encoded one. The encoding is internal to Python.

I hope that helps; good luck.

¹ Note that I find the "\N{...}" form much easier to read, and recommend it.