There is a long standing bug in RJSONIO
for parsing json strings containing unicode escape sequences. It seems like the bug needs to be fixed in libjson
which might not happen any time soon, so I am looking in creating a workaround in R which unescapes \uxxxx
sequences before feeding them to the json parser.
Some context: json data is always unicode, using utf-8
by default, so there is generally no need for escaping. But for historical reasons, json does support escaped unicode. Hence the json data
{"x" : "Zürich"}
and
{"x" : "Z\u00FCrich"}
are equivalent and should result in exactly the same output when parsed. But for whatever reason, the latter doesn't work in RJSONIO
. Additional confusion is caused by the fact that R itself supports escaped unicode as well. So when we type "Z\u00FCrich"
in an R console, it is automatically correctly converted to "Zürich"
. To get the actual json string at hand, we need to escape the backslash itself that is the first character of the unicode escape sequence in json:
test <- '{"x" : "Z\\u00FCrich"}'
cat(test)
So my question is: given a large json string in R, how can I unescape all escaped unicode sequences? I.e. how do I replace all occurrences of \uxxxx
by the corresponding unicode character? Again, the \uxxxx
here represents an actual string of 6 characters, starting with a backslash. So an unescape
function should satisfy:
#Escaped string
escaped <- "Z\\u00FCrich"
#Unescape unicode
unescape(escaped) == "Zürich"
#This is the same thing
unescape(escaped) == "Z\u00FCrich"
One thing that might complicate things is that if the backslash itself is escaped in json with another backslash, it is not part of the unicode escape sequence. E.g. unescape
should also satisfy:
#Watch out for escaped backslashes
unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich"
unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich"
Maybe like this?
This is not looking letters. Just waiting for a quote
After playing with this some more I think the best I can do is searching for
\uxxxx
patterns using a regular expression, and then parse those using the R parser:This seems to work for all cases and I haven't found any odd side effects yet
There is a function for this in
stringi
package :)