When you have incorrectly decoded characters, how can you identify likely candidates for the original string?
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png
I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.
Is the corruption reversible?
You could use chardet (install with pip):
import chardet
your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]
try:
correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
print("Could not estimate encoding")
Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)
For Python 3 (source file encoded as utf8):
import chardet
import codecs
falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"
try:
encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
print("could not encode falsely decoded string")
encoded_str = None
if encoded_str:
detected_encoding = chardet.detect(encoded_str)["encoding"]
try:
correct_str = encoded_str.decode(detected_encoding)
except UnicodeEncodeError:
print("could not decode encoded_str as %s" % detected_encoding)
with codecs.open("output.txt", "w", "utf-8-sig") as out:
out.write(correct_str)
In summary:
>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'