I'm following the Unicode - How to get the characters right? post.
The only issue I have is with JSONObject encoding (I'm using org.json
lib).
The issue arises when I put a string like àòùè쀀
, for example, in a JSONObject.
System.out.println(entry.getValue());
JSONObject temp = new JSONObject();
temp.put("values", entry.getValue();
System.out.println(temp.toString());
I obtain àòùè쀀
and {"values":"àòùèì\u20ac\u20ac"}
instead of {"values":"àòùè쀀"}
.
EDIT
By passing from an hashtable to a jsonObject, the extended utf-8 encoding is used. For example, the hashtable
{€èòàùì€ù=èòàù€ì, €òàèùì€=èòàù€ìç§$}
becomes the JSONObject
{"\u20acòàèùì\u20ac":"èòàù\u20acìç§$","\u20acèòàùì\u20acù":"èòàù\u20acì"}
They are exactly equal, with the Unicode escaping taking a bit more space. Like writing
\u004a
in Java is exactly the same as writinga
. If correctness is your concern, it doesn't matter.And it won't take considerable amount of extra space either unless most of your text is between 0x2000 - 0x20FF:
The following code escapes C0 and C1 control characters, but it also escapes 0x2000 - 0x20FF:
So any character between 0x2000 - 0x20FF and control characters are represented as unicode escapes. This makes sense for control characters because those are not allowed in JSON in their unescaped form.
As for 0x2000 - 0x20FF, I have no idea because the code is not commented. Every character unescaped in that range is valid JSON. Of course,
0x2028
and0x2029
are not valid in Javascript (so this small detail makes JSON syntax not a subset of Javascript syntax), so it's good idea to escape those in JSON in case it is being used as JSONP which is Javascript really. But it is not apparent to me why the code escapes a whole range because just 2 characters in the range are illegal.