R JSON UTF-8 parsing

2020-03-31 04:52发布

I have an issue when trying to parse a JSON file in russian alphabet in R. The file looks like this:

[{"text": "Валера!", "type": "status"}, {"text": "когда выйдет", "type": "status"}, {"text": "КАК ДЕЛА?!)", "type": "status"}]

and it is saved in UTF-8 encoding. I tried libraries rjson, RJSONIO and jsonlite to parse it, but it doesn't work:

library(jsonlite)
allFiles <- fromJSON(txt="ru_json_example_short.txt")

gives me error

Error in feed_push_parser(buf) : 
  lexical error: invalid char in json text.
                                       [{"text": "Валера!", "
                     (right here) ------^

When I save the file in ANSI encodieng, it works OK, but then, the Russian alphabet transforms into question marks, so the output is unusable. Does anyone know how to parse such JSON file in R, please?

Edit: Above mentioned applies for UTF-8 file saved in Windows Notepad. When I save it in PSPad and the parse it, the result looks like this:

    text   type
1                                         <U+0412><U+0430><U+043B><U+0435><U+0440><U+0430>! status
2 <U+043A><U+043E><U+0433><U+0434><U+0430> <U+0432><U+044B><U+0439><U+0434><U+0435><U+0442> status
3                              <U+041A><U+0410><U+041A> <U+0414><U+0415><U+041B><U+0410>?!) status

1条回答
太酷不给撩
2楼-- · 2020-03-31 05:13

Try the following:

dat <- fromJSON(sprintf("[%s]",
                paste(readLines("./ru_json_example_short.txt"),
                collapse=",")))
dat
[[1]]
       text   type
1      Валера! status
2 когда выйдет status
3  КАК ДЕЛА?!) status

ref: Error parsing JSON file with the jsonlite package

查看更多
登录 后发表回答