This question already has an answer here:
-
SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'
4 answers
-
Correct way to define Python source code encoding
6 answers
I keep getting an error and I'm not sure on how to fix it.
The Code line:
if not len(lines) or lines[-1] == '' or lines[-1] == '▁':
lines = list(filter(lambda line: False if line == '' or line == '▁' else True, list(lines)))
Output:
SyntaxError: Non-ASCII character '\xe2' in file prepare_data.py on line 512, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the bytes in the string which displays as a funky underscore.
If you want to match U+2581 then you can say
.... or lines[-1] == '\u2581':
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to match a regular ASCII underscore, that's ASCII 95 / U+005F; here are the two characters side by side for easy comparison and possible copy/paste:
U+2581 ▁ _ U+005F
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#e2 shows 21 possible interpretations for the byte 0xE2 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. In fact, I would guess you are actually using UTF-8, which represents this character as the three bytes 0xE2 0x96 0x81; but without also seeing the character rendered as something resembling an underscore, there would be absolutely no way to guess this for a human, either.
Try this. I haven't tested it, but I think it might solve your encoding problem. Your code needs some improvements for readability, remember the Zen of Python please.
def filter_line(line):
if not line or line == '▁':
return False
else:
return True
lines = [line.encode("utf-8") for line in lines]
if not lines or lines[-1] == '' or lines[-1] == '▁':
lines = list(filter(filter_lines, list(lines)))