What do I have to do in Python to figure out which encoding a string has?
问题:
回答1:
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
回答2:
How to tell if an object is a unicode string or a byte string
You can use type
or isinstance
.
In Python 2:
>>> type(u'abc') # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc') # Python 2 byte string literal
<type 'str'>
In Python 2, str
is just a sequence of bytes. Python doesn't know what
its encoding is. The unicode
type is the safer way to store text.
If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.
In Python 3:
>>> type('abc') # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc') # Python 3 byte string literal
<class 'bytes'>
In Python 3, str
is like Python 2's unicode
, and is used to
store text. What was called str
in Python 2 is called bytes
in Python 3.
How to tell if a byte string is valid utf-8 or ascii
You can call decode
. If it raises a UnicodeDecodeError exception, it wasn't valid.
>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
回答3:
In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.
isinstance(x, str)
With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.
If you want to check if you have a 'string-like' object all with one statement though, you can do the following:
isinstance(x, basestring)
回答4:
Unicode is not an encoding - to quote Kumar McMillan:
If ASCII, UTF-8, and other byte strings are "text" ...
...then Unicode is "text-ness";
it is the abstract form of text
Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.
回答5:
If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like isinstance(s,bytes)
or isinstance(s,unicode)
without wrapping them in either try/except or a python version test, because bytes
is undefined in Python 2 and unicode
is undefined in Python 3.
There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:
# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
# only possible in Python 3
s = s.decode('ascii') # or s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
# only possible in Python 2
s = str(s)
An arguably slightly less ugly workaround is to check the Python version number, e.g.:
if sys.version_info >= (3,0,0):
# for Python 3
if isinstance(s, bytes):
s = s.decode('ascii') # or s = str(s)[2:-1]
else:
# for Python 2
if isinstance(s, unicode):
s = str(s)
Those are both unpythonic, and most of the time there's probably a better way.
回答6:
use:
import six
if isinstance(obj, six.text_type)
inside the six library it is represented as:
if PY3:
string_types = str,
else:
string_types = basestring,
回答7:
Note that on Python 3, it's not really fair to say any of:
str
s are UTFx for any x (eg. UTF8)str
s are Unicodestr
s are ordered collections of Unicode characters
Python's str
type is (normally) a sequence of Unicode code points, some of which map to characters.
Even on Python 3, it's not as simple to answer this question as you might imagine.
An obvious way to test for ASCII-compatible strings is by an attempted encode:
"Hello there!".encode("ascii")
#>>> b'Hello there!'
"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>> File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)
The error distinguishes the cases.
In Python 3, there are even some strings that contain invalid Unicode code points:
"Hello there!".encode("utf8")
#>>> b'Hello there!'
"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>> File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
The same method to distinguish them is used.
回答8:
This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.
def return_utf(s):
if isinstance(s, str):
return s.encode('utf-8')
if isinstance(s, (int, float, complex)):
return str(s).encode('utf-8')
try:
return s.encode('utf-8')
except TypeError:
try:
return str(s).encode('utf-8')
except AttributeError:
return s
except AttributeError:
return s
return s # assume it was already utf-8
回答9:
You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it's impossible to know encoding of a string "abc" for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.
回答10:
For py2/py3 compatibility simply use
import six
if isinstance(obj, six.text_type)
回答11:
One simple approach is to check if unicode
is a builtin function. If so, you're in Python 2 and your string will be a string. To ensure everything is in unicode
one can do:
import builtins
i = 'cats'
if 'unicode' in dir(builtins): # True in python 2, False in 3
i = unicode(i)