I am using python to parse a JSON file, I know it is because of this ¥,
that I got this error
when I was using json.loads
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 106:
invalid start byte
But how do I get around it? Do I decode and encode again?
¥ is the Chinese currency sign, but I am not sure which code category it belongs to.
Thanks!
update:
====================
I think my question should be, If you see this symbol, how do you guess the encoding.
An answer to this question maybe:
If you see ¥, then "utf-8" won't work, try "latin-1" instead.
Is this understanding correct?
The real answer is, in the general case, you cannot determine the encoding of an unknown piece of data.
Given context, such as English text, you can sometimes guess e.g. that c?rrupted
has had "o" replaced by "?", but if you don't have that sort of context, you can't even tell which bytes are wrong.
For your specific example, you are asking it the wrong way around. If you see a yen sign, which encoding are you using to look at the data? If it's Latin-1, then you are looking at a byte value of 0xA5. This value can be looked up; you could be looking at any of v, ¥, ¸ , Ë, Í, Ñ, Ą, ą, ċ, Ĩ, Ľ, ź, Β, Ξ, ξ, Ѕ, Ц, е, Ґ, Ҙ,
ح, ٪, ۴, ฅ, „, •, ₯, ╔,
ﺄ, or a fragment out of a multi-byte encoding.
If the program or organization which produced the unknown data is available, you can talk to people and/or experiment with the software; but if an authoritative answer can't be found, you end up just guessing, or giving up.
There is a reason modern formats require a known encoding, and will reject input which clearly violates that.
The problem was solve by using the following code:
json.loads(contents,encoding='latin1')
I was confused about the encoding, the source did not specify it clearly.