How to check if a string in Python is in ASCII?

2019-01-02 22:20发布

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).

Is there another way to check?

16条回答
在下西门庆
2楼-- · 2019-01-02 22:42

Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

查看更多
Bombasti
3楼-- · 2019-01-02 22:43

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

查看更多
姐就是有狂的资本
4楼-- · 2019-01-02 22:44

Ran into something like this recently - for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')
查看更多
手持菜刀,她持情操
5楼-- · 2019-01-02 22:46

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

查看更多
该账号已被封号
6楼-- · 2019-01-02 22:47

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you're getting a rude and persistent

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'
查看更多
够拽才男人
7楼-- · 2019-01-02 22:48

I think you are not asking the right question--

A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.

Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"
查看更多
登录 后发表回答