Remove accented characters form string - Python

2020-05-01 08:47发布

I get some data from a webpage and read it like this in python

origional_doc = urllib2.urlopen(url).read()

Sometimes this url has characters such as é and ä and ect., how could I remove these characters, from the string, right now this is what I am trying,

import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))

But I get an error

TypeError: must be unicode, not str

标签： python unicode

2条回答

Root（大扎）

2楼-- · 2020-05-01 09:17

using re you can sub all characters that are in a certain hexadecimal ascii range.

>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and  and ect'

You can also do the inverse and sub anything thats NOT in the basic 128 characters:

>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and  and ect'

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2020-05-01 09:28

This should work. It will eliminate all characters that are not ascii.

    original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))

0人赞添加讨论(0) 举报

Remove accented characters form string - Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间