How to make the python interpreter correctly handl-第2页回答

I have a string that looks like so:

6Â 918Â 417Â 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

It gets no further than s.replace...

标签： python unicode

11条回答

Summer. ? 凉城

2楼-- · 2019-01-02 22:44

Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-01-02 22:45

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

0人赞添加讨论(0) 举报

做自己的国王

4楼-- · 2019-01-02 22:53

def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))

edit: my first impulse is always to use a filter, but the generator expression is more memory efficient (and shorter)...

def removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

Keep in mind that this is guaranteed to work with UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

0人赞添加讨论(0) 举报

地球回转人心会变

5楼-- · 2019-01-02 22:54

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

0人赞添加讨论(0) 举报

叼着烟拽天下

6楼-- · 2019-01-02 22:57

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'

0人赞添加讨论(0) 举报

上一页 1 2

How to make the python interpreter correctly handl

Example (Python 3)

Output

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间