What do I have to do in Python to figure out which encoding a string has?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.
With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.
If you want to check if you have a 'string-like' object all with one statement though, you can do the following:
Unicode is not an encoding - to quote Kumar McMillan:
Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.
If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like
isinstance(s,bytes)
orisinstance(s,unicode)
without wrapping them in either try/except or a python version test, becausebytes
is undefined in Python 2 andunicode
is undefined in Python 3.There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:
An arguably slightly less ugly workaround is to check the Python version number, e.g.:
Those are both unpythonic, and most of the time there's probably a better way.
This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.
You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it's impossible to know encoding of a string "abc" for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.