“surrogateescape” cannot escape certain characters

2019-06-17 23:39发布

问题:

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.

However, while opening a file and then attempting to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

Results in:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.

This is the line that is throwing the error when decoded as UTF-8:

'Zoë\'s Coffee House'

This is the hex encoding:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\'s Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.

回答1:

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"


回答2:

A lone surrogate should NOT be encoded in UTF-8 -- which is precisely why it was used for the internal representation of invalid input.

In real life, it is pretty common to get data that is invalid for the encoding it is "supposed" to be in. For example, this question was inspired by text that appears to be in Latin-1, when ASCII or UTF-8 was expected. I put "supposed" in quotes, because it is pretty common for the "encoding information" to just be a guess, perhaps unrelated to the actual file.

By default, xml processing (and most unicode processing) is strict -- the entire process gives up even though it could process hundreds of other lines just fine.

Decoding with errors=replace would turn that line into "Zo?'s Coffee House", which is an improvement. (Well, unless you tried to replace invalid characters with something else that isn't valid either -- and the official unicode replacement character isn't valid in ASCII, which is why a '?' is typically used for encoding.)

surrogateescape is used when the programmer decides "You know what? I don't care if the data is garbage. Maybe I have the wrong codec ... so I'll just pass the unknown bytes along as-is." Python does have to store (but avoid interpreting) those bytes internally until they are passed along.

Using unpaired surrogates allows Python to store the invalid bytes without extra escaping. Precisely because unpaired surrogates are invalid, they will never appear in valid input. (And if they occur anyhow, they'll be interpreted as a pair of unrecognized bytes, both of which get preserved for output.)

The original poster's problem is that he was trying to print out that internal representation directly, instead of reversing the mapping first, and the internal representation had bytes that (intentionally) weren't valid ... so the default (strict) error handler refused.



回答3:

For what reason should a low-surrogate DCC3 be encoded in utf-8? This is not allowed and useless because a surrogate is NOT a character. Find the high-surrogate that belongs to the low-surrogate, decode its codepoint and then create the proper utf-8 sequence for the codepoint.