可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using python to parse a JSON file, I know it is because of this ¥,

that I got this error when I was using json.loads

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 106:
invalid start byte

But how do I get around it? Do I decode and encode again?

¥ is the Chinese currency sign, but I am not sure which code category it belongs to.

Thanks!

update:

====================

I think my question should be, If you see this symbol, how do you guess the encoding.

An answer to this question maybe:

If you see ¥, then "utf-8" won't work, try "latin-1" instead. Is this understanding correct?

回答1:

The real answer is, in the general case, you cannot determine the encoding of an unknown piece of data.

Given context, such as English text, you can sometimes guess e.g. that c?rrupted has had "o" replaced by "?", but if you don't have that sort of context, you can't even tell which bytes are wrong.

For your specific example, you are asking it the wrong way around. If you see a yen sign, which encoding are you using to look at the data? If it's Latin-1, then you are looking at a byte value of 0xA5. This value can be looked up; you could be looking at any of v‎, ¥‎, ¸‎ , Ë‎, Í‎, Ñ‎, Ą‎, ą‎, ċ‎, Ĩ‎, Ľ‎, ź‎, Β‎, Ξ‎, ξ‎, Ѕ‎, Ц‎, е‎, Ґ‎, Ҙ‎, ح‎, ٪‎, ۴‎, ฅ‎, „‎, •‎, ₯‎, ╔‎, ﺄ‎, or a fragment out of a multi-byte encoding.

If the program or organization which produced the unknown data is available, you can talk to people and/or experiment with the software; but if an authoritative answer can't be found, you end up just guessing, or giving up.

There is a reason modern formats require a known encoding, and will reject input which clearly violates that.