I want to I check whether a string is in ASCII or not.
I am aware of ord()
, however when I try ord('é')
, I have TypeError: ord() expected a character, but string of length 2 found
. I understood it is caused by the way I built Python (as explained in ord()
's documentation).
Is there another way to check?
Vincent Marchetti has the right idea, but
str.decode
has been deprecated in Python 3. In Python 3 you can make the same test withstr.encode
:Note the exception you want to catch has also changed from
UnicodeDecodeError
toUnicodeEncodeError
.Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of
ord(u'é')
, try this:That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of
chr()
to reverse this, there isunichr()
:This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure
u'e\u0301'
oru'\u00e9'
.Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words,
len(u'e\u0301') == 2
andlen(u'\u00e9') == 1
. If this matters to you, you can convert between composed and decomposed forms by usingunicodedata.normalize
.The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.
To include an empty string as ASCII, change the
+
to*
.