Python Syntax error: non-ASCII [duplicate]

2019-08-28 07:04发布

问题:

This question already has an answer here:

  • SyntaxError: Non-ASCII character '\xa3' in file when function returns '£' 4 answers
  • Correct way to define Python source code encoding 6 answers

I keep getting an error and I'm not sure on how to fix it.

The Code line:

if not len(lines) or lines[-1] == '' or lines[-1] == '▁':
    lines = list(filter(lambda line: False if line == '' or line == '▁' else True, list(lines)))

Output: SyntaxError: Non-ASCII character '\xe2' in file prepare_data.py on line 512, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

回答1:

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the bytes in the string which displays as a funky underscore.

If you want to match U+2581 then you can say

.... or lines[-1] == '\u2581':

which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to match a regular ASCII underscore, that's ASCII 95 / U+005F; here are the two characters side by side for easy comparison and possible copy/paste:

U+2581 ▁  _ U+005F

The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be

# coding=utf-8

or the Emacs-compatible

# -*- encoding: utf-8 -*-

If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.

In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#e2 shows 21 possible interpretations for the byte 0xE2 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. In fact, I would guess you are actually using UTF-8, which represents this character as the three bytes 0xE2 0x96 0x81; but without also seeing the character rendered as something resembling an underscore, there would be absolutely no way to guess this for a human, either.



回答2:

Try this. I haven't tested it, but I think it might solve your encoding problem. Your code needs some improvements for readability, remember the Zen of Python please.

def filter_line(line):
    if not line or line == '▁':
        return False
    else:
        return True

lines = [line.encode("utf-8") for line in lines]

if not lines or lines[-1] == '' or lines[-1] == '▁':
    lines = list(filter(filter_lines, list(lines)))