Displaying UTF-8 encoded Chinese characters in R

2019-02-05 02:19发布

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.

For instance:

data <-read.csv("mydata.csv", encoding="UTF-8")

data

will produce unicode characters, while:

data <-read.csv("mydata.csv", encoding="UTF-8")

data[,1]

will actually display Chinese characters.

If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.

I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.

My current locale is:

"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"

Any help to get R to consistently display Chinese characters would be greatly appreciated...

标签: r utf-8 locale
2条回答
手持菜刀,她持情操
2楼-- · 2019-02-05 02:33

In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.But the utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims it save as etf8.

(1) Download 'open sheet'.

(2) Open it properly. You can scroll the encoding method until you see the Chinese character displayed in the preview windows.

(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

查看更多
该账号已被封号
3楼-- · 2019-02-05 02:39

Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.

You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.

@nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.

查看更多
登录 后发表回答