I want to I check whether a string is in ASCII or not.
I am aware of ord()
, however when I try ord('é')
, I have TypeError: ord() expected a character, but string of length 2 found
. I understood it is caused by the way I built Python (as explained in ord()
's documentation).
Is there another way to check?
Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of
find_all
ormatch
.I imagine a regular expression is well-optimized for this.
You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.
Ran into something like this recently - for future reference
which you could use with:
A sting (
str
-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.However if you know the encoding used, then you can
decode
the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying: