Unicode error handling with Python 3's readlin

2019-03-09 11:06发布

I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?

UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined.

3条回答
我只想做你的唯一
2楼-- · 2019-03-09 11:10

In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

For instance:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

to simply ignore them.

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.

查看更多
虎瘦雄心在
3楼-- · 2019-03-09 11:16

Yeah..you could wrap it in a

try:
    ....
except UnicodeEncodeError: 
    pass
查看更多
ゆ 、 Hurt°
4楼-- · 2019-03-09 11:30

You should open the file with a codecs to make sure that the file gets interpreted as UTF8.

import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()
查看更多
登录 后发表回答