UnicodeEncodeError writing text with special chara

2019-09-05 13:34发布

问题:

I get a UnicodeEncodeError writing text with a special character to a file:

  File "D:\SOFT\Python3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 956: character maps to <undefined>

My code:

expFile = open(expFilePath, 'w')
# data var is what contains a special char
expFile.write("\n\n" + data)

The data is probably some weird character from something like Microsoft Word that got pasted into the application's HTML form and it got persisted, now I am importing it. I can't even see it, shows as a diamond in my DB editor when I query it. It just has a placeholder in the text editor. The input should be more rigorously checked for character set compliance but it is not.

Is there a way to encode the data to make any character digestable for I/O processing?

Alternatively, is there a way to check whether my str is compliant to the character standard expected by file IO in order to do replacements of any data that violates it?

回答1:

Your problem is that opening in text mode on your Windows system defaulted to the locale code page, cp1252, an ASCII superset that only encodes a tiny fraction of the Unicode range.

To fix, supply a more comprehensive encoding that can support the whole Unicode range; open accepts a keyword argument to override the default encoding, so it's as simple as changing:

expFile = open(expFilePath, 'w')

to

expFile = open(expFilePath, 'w', encoding='utf-8')

Depending on your needs, I'd choose either utf-8 or utf-16; the former is more compact for mostly ASCII text, and is commonly seen everywhere, while the latter matches Microsoft's typical encoding for storing portable (non-locale dependent) text, so it's possible a few Windows-specific text editors would recognize it/handle it more easily.