i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Before Python 3.0, string encoding was
ascii
by default, but could be changed. Unicode string literals wereu"..."
. This was silly.Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.
In Python 3.x
str
is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str
is a byte string type like Cchar
. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function likestruct.pack
, it's binary data, and doesn't meaningfully have a character encoding at all.unicode
strings in 2.x are equivalent tostr
in 3.x.Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.
In Python 2: Normal strings (Python 2.x
str
) don't have an encoding: they are raw data.In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean
unicode
instances in Python 2 andstr
instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling
encode
and turn bytes into unicode usingdecode
, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
from __future__ import unicode_literals
# -*- coding: utf-8 -*-
. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.str
in py3k,unicode
in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value ofsys.maxunicode
. If it's1114111
, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points0x0000
to0xFFFF
) covers most people's needs. For more information, see PEP 0261.From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.