UnicodeDecodeError while processing Accented words

2019-08-28 02:23发布

问题:

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

only in the embedded environment.

The YAML sample:

data: ã

The snippet which reads the YAML:

with open(YAML_FILE, 'r') as stream:
  try:
    data = yaml.load(stream)

Tried a bunch of solutions without success.

Versions: Python 3.6, PyYAML 3.12

回答1:

The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.

The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.

A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.

If you can change your codec to be a UTF-8 decode, it should work.

In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.



回答2:

You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).

import yaml

YAML_FILE = 'input.yaml'

with open(YAML_FILE, encoding='utf-8') as stream:
    data = yaml.safe_load(stream)

Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.

To dump data in the same format you loaded it use:

import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
               default_flow_style=False)

The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)