I am working with external data that's encoded in latin1. So I've add sitecustomize.py
and in it added
sys.setdefaultencoding('latin_1')
sure enough, now working with latin1 strings works fine.
But, in case I encounter something that is not encoded in latin1:
s=str(u'abc\u2013')
I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?'
, and do that without explicitly calling decode()
or encode
each time, i.e not s.decode(...,'replace') on each call.
I tried doing different things with codecs.register_error
but to no avail.
please help?
There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.
Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.
Explicit decoding takes a parameter specifying behavior for undecodable bytes.
You can define your own custom handler and use it instead to do as you please. See this example:
import codecs
from logging import getLogger
log = getLogger()
def custom_character_handler(exception):
log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
exception.reason,
exception.object[exception.start:exception.end],
exception.encoding,
exception.start,
exception.end )
return ("?", exception.end)
codecs.register_error("custom_character_handler", custom_character_handler)
print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )
Running it, you will see:
invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'
References:
- https://docs.python.org/3/library/codecs.html#codecs.register_error
- https://docs.python.org/3/library/exceptions.html#UnicodeError
- How to ignore invalid lines in a file?
- 'str' object has no attribute 'decode'. Python 3 error?
- How to replace invalid unicode characters in a string in Python?
- UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?