可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to replace all non-ASCII (\\x00-\\x7F) characters with a space. I\'m surprised that this is not dead-easy in Python, unless I\'m missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return \'\'.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r\'[^\\x00-\\x7F]\',\' \', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

回答1:

Your \'\'.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return \'\'.join([i if ord(i) < 128 else \' \' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r\'[^\\x00-\\x7F]+\',\' \', text)

Note the + there.

回答2:

For you the get the most alike representation of your original string I recommend the unidecode module:

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = \"utf-8\"))

Then you can use it in a string:

remove_non_ascii(\"Ceñía\")
Cenia

回答3:

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s=\'ABC马克def\'
>>> import re
>>> re.sub(r\'[^\\x00-\\x7f]\',r\' \',s)   # Each char is a Unicode codepoint.
\'ABC  def\'
>>> b = s.encode(\'utf8\')
>>> re.sub(rb\'[^\\x00-\\x7f]\',rb\' \',b) # Each char is a 3-byte UTF-8 sequence.
b\'ABC      def\'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = \'mañana\'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize(\'NFD\',s)
>>> n
\'mañana\'
>>> len(n)
7
>>> re.sub(r\'[^\\x00-\\x7f]\',r\' \',s) # single codepoint
\'ma ana\'
>>> re.sub(r\'[^\\x00-\\x7f]\',r\' \',n) # only combining mark replaced
\'man ana\'

回答4:

If the replacement character can be \'?\' instead of a space, then I\'d suggest result = text.encode(\'ascii\', \'replace\').decode():

\"\"\"Test the performance of different non-ASCII replacement methods.\"\"\"


import re
from timeit import timeit


# 10_000 is typical in the project that I\'m working on and most of the text
# is going to be non-ASCII.
text = \'Æ\' * 10_000


print(timeit(
    \"\"\"
result = \'\'.join([c if ord(c) < 128 else \'?\' for c in text])
    \"\"\",
    number=1000,
    globals=globals(),
))

print(timeit(
    \"\"\"
result = text.encode(\'ascii\', \'replace\').decode()
    \"\"\",
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

回答5:

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode(\"ascii\")
         except:
              #means it\'s non-ASCII
              unicode_string=unicode_string[i].replace(\" \") #replacing it with a single space
     return unicode_string

回答6:

As a native and efficient approach, you don\'t need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode(\'ascii\',errors=\'ignore\')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b\' \' * (len(old_string) - len(new_string))

Replace non-ASCII characters with a single space

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

收藏的人(0)

Replace non-ASCII characters with a single space

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮