I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.
For instance:
data <-read.csv("mydata.csv", encoding="UTF-8")
data
will produce unicode characters, while:
data <-read.csv("mydata.csv", encoding="UTF-8")
data[,1]
will actually display Chinese characters.
If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.
I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.
My current locale is:
"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"
Any help to get R to consistently display Chinese characters would be greatly appreciated...
In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.But the utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims it save as etf8.
Not a bug, more a misunderstanding of the underlying type system conversions (the
character
type and thefactor
type) when constructing adata.frame
.You could start first with
data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE)
which will make your Chinese characters to be of thecharacter
type and so by printing them out you should see waht you are expecting.@nograpes: similarly
x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE)
and everything should be ok.