What do I have to do in Python to figure out which encoding a string has?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Note that on Python 3, it's not really fair to say any of:
str
s are UTFx for any x (eg. UTF8)str
s are Unicodestr
s are ordered collections of Unicode charactersPython's
str
type is (normally) a sequence of Unicode code points, some of which map to characters.Even on Python 3, it's not as simple to answer this question as you might imagine.
An obvious way to test for ASCII-compatible strings is by an attempted encode:
The error distinguishes the cases.
In Python 3, there are even some strings that contain invalid Unicode code points:
The same method to distinguish them is used.
use:
inside the six library it is represented as:
For py2/py3 compatibility simply use
import six if isinstance(obj, six.text_type)
How to tell if an object is a unicode string or a byte string
You can use
type
orisinstance
.In Python 2:
In Python 2,
str
is just a sequence of bytes. Python doesn't know what its encoding is. Theunicode
type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.In Python 3:
In Python 3,
str
is like Python 2'sunicode
, and is used to store text. What was calledstr
in Python 2 is calledbytes
in Python 3.How to tell if a byte string is valid utf-8 or ascii
You can call
decode
. If it raises a UnicodeDecodeError exception, it wasn't valid.One simple approach is to check if
unicode
is a builtin function. If so, you're in Python 2 and your string will be a string. To ensure everything is inunicode
one can do:In Python 3, all strings are sequences of Unicode characters. There is a
bytes
type that holds raw bytes.In Python 2, a string may be of type
str
or of typeunicode
. You can tell which using code something like this:This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.