I'm observing that in the program
# -*- coding: utf-8 -*-
words = ['artists', 'Künstler', '艺术家', 'Митець']
for word in words:
print word, type(word)
it is not absolutely necessary to fully qualify the strings as unicode strings:
words = ['artist', u'Künstler', u'艺术家', u'Митець']
The different alphabets are handled just fine without the 'u' prefix.
And so it appears that once coding: utf-8 is specified, all strings are encoded in Unicode. Is that true?
- Or is unicode used only if the string can no longer fit in range(128)?
- Why does
type(word)
report<str>
in all cases? Isn'tunicode
a special datatype?
No. It means that byte sequences within the source code are interpreted as UTF-8. You have created bytestrings and the system is interpreting their contents naively (versus creating text with
u'...'
).Perhaps this will make it more clear:
Output:
In the first case you get byte strings encoded in the declared source encoding of UTF-8. They will only display correctly on a UTF-8 terminal.
In the second case you get Unicode strings. They will display correctly on any terminal whose encoding supports the characters.
Here's how the strings display on a Windows code page 437 console, using a Python environment variable to configure Python to replace unsupported characters instead of raising the default
UnicodeEncodeError
exception for them:Bytes strings are mostly garbage, but Unicode strings are sensible since Chinese and Russian aren't supported by that code page.