R JSON UTF-8 parsing

2020-03-31 04:42发布

问题:

I have an issue when trying to parse a JSON file in russian alphabet in R. The file looks like this:

[{"text": "Валера!", "type": "status"}, {"text": "когда выйдет", "type": "status"}, {"text": "КАК ДЕЛА?!)", "type": "status"}]

and it is saved in UTF-8 encoding. I tried libraries rjson, RJSONIO and jsonlite to parse it, but it doesn't work:

library(jsonlite)
allFiles <- fromJSON(txt="ru_json_example_short.txt")

gives me error

Error in feed_push_parser(buf) : 
  lexical error: invalid char in json text.
                                       [{"text": "Валера!", "
                     (right here) ------^

When I save the file in ANSI encodieng, it works OK, but then, the Russian alphabet transforms into question marks, so the output is unusable. Does anyone know how to parse such JSON file in R, please?

Edit: Above mentioned applies for UTF-8 file saved in Windows Notepad. When I save it in PSPad and the parse it, the result looks like this:

    text   type
1                                         <U+0412><U+0430><U+043B><U+0435><U+0440><U+0430>! status
2 <U+043A><U+043E><U+0433><U+0434><U+0430> <U+0432><U+044B><U+0439><U+0434><U+0435><U+0442> status
3                              <U+041A><U+0410><U+041A> <U+0414><U+0415><U+041B><U+0410>?!) status

回答1:

Try the following:

dat <- fromJSON(sprintf("[%s]",
                paste(readLines("./ru_json_example_short.txt"),
                collapse=",")))
dat
[[1]]
       text   type
1      Валера! status
2 когда выйдет status
3  КАК ДЕЛА?!) status

ref: Error parsing JSON file with the jsonlite package