Processing non-english text

2019-09-25 03:48发布

问题:

I have a python file that reads a file given by the user, processes it, and ask questions in flash card format. The program works fine with an english txt file but I encounter errors when trying to process a french file.

When I first encountered the error, I was using the windows command prompt window and running python cards.py. When inputting the french file, I immediately got a UnicodeEncodeError. After digging around, I found that it may have something to do with the fact I was using the cmd window. So I tried using IDLE. I didn't get any errors but I would get weird characters like œ and à and ®.

Upon further research, I found some documentation that instructs to use encoding='insert encoding type' in the open(file) part of my code. After running the program again in IDLE, it seemed to minimize the problem, but I would still get some weird characters. When running it in the cmd, it wouldn't break IMMEDIATELY, but would eventually when it encountered an unknown character.

My question: what do I implement to ensure the program can handle ALL of the chaaracters in the file (given any language) and why does IDLE and the command prompt handle the file differently?

EDIT: I forgot to mention that I ended up using utf-8 which gave the results I described.

回答1:

It's common question. Seems that you're using cmd which doesn't support unicode, so error occurs during translation of output to the encoding, which your cmd runs. And as unicode has a wider charset, than encoding used in cmd, it gives an error

IDLE is built ontop of tkinter's Text widget, which perfectly supports Python strings in unicode.

And, finally, when you specify a file you'd like to open, the open function assumes that it's in platform default (per locale.getpreferredencoding()). So if your file encoding differs, you should exactly mention it in keyword arg encoding to open func.



回答2:

The Windows console does not natively support Unicode (despite what people say about chcp 65001). It's designed to be backwards compatible so only supports 8bit character sets.

Use win-unicode-console instead. It talks to the cmd at a lower level, which allows all Unicode characters to be printed, and importantly, inputted.

The best way to enable it is in your usercustomize script, so that's enabled by default on your machine.