I have a likely improperly encoded json document from a source I do not control, which contains the following strings:
d\u00c3\u00a9cor
business\u00e2\u20ac\u2122 active accounts
the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
From this, I am gathering they intend for \u00c3\u00a9
to beceom é
, which would be utf-8 hex C3 A9
. That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks.
My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine writing some code to transform their broken input into something I can understand, as it is highly unlikely they would be able to fix the system if I brought it to their attention.
Any ideas how to force their input to something I can understand? For the record, I am working in Python.
You should try the ftfy module:
>>> print ftfy.ftfy(u"d\u00c3\u00a9cor")
décor
>>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts")
business' active accounts
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label")
the "Made in the USA" label
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False)
the “Made in the USA” label
You have Mojibake data here; UTF-8 data decoded from bytes with the wrong codec.
The trick is to figure out which encoding was used to decode, before producing the JSON output. The first two samples can be repaired if you assume the encoding was Windows Codepage 1252:
>>> sample = u'''\
... d\u00c3\u00a9cor
... business\u00e2\u20ac\u2122 active accounts
... the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
... '''.splitlines()
>>> print sample[0].encode('cp1252').decode('utf8')
décor
>>> print sample[1].encode('cp1252').decode('utf8')
business’ active accounts
but this codec fails for the 3rd:
>>> print sample[2].encode('cp1252').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 24: character maps to <undefined>
The first 3 'weird' bytes are certainly a CP1252 Mojibake for the U+201C LEFT DOUBLE QUOTATION MARK codepoint:
>>> sample[2]
u'the \xe2\u20ac\u0153Made in the USA\xe2\u20ac\x9d label'
>>> sample[2][:22].encode('cp1252').decode('utf8')
u'the \u201cMade in the USA'
so the other combo is presumably meant to be U+201D RIGHT DOUBLE QUOTATION MARK, but the latter character results in a UTF-8 byte not normally present in CP1252:
>>> u'\u201d'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>
That's because there is no hex 9D position in the CP1252 codec, but the codepoint did make it into the JSON output:
>>> sample[2][22:]
u'\xe2\u20ac\x9d label'
The ftfy
library Ned Batchelder so helpfully alerted me to uses a 'sloppy' CP1252 codec to work around that issue, mapping non-existing bytes one-on-one (UTF-8 byte to Latin-1 Unicode point). The resulting 'fancy quotes' are then mapped to ASCII quotes by the library, but you can switch that off:
>>> import ftfy
>>> ftfy.fix_text(sample[2])
u'the "Made in the USA" label'
>>> ftfy.fix_text(sample[2], uncurl_quotes=False)
u'the \u201cMade in the USA\u201d label'
Since this library automates this task for you, and does a better job than the standard Python codecs can do for you here, you should just install it, and apply it to the mess this API hands you. Don't hesitate to berate the people that hand you this data, however, if you have half a chance. They have produced one lovely muck-up.