How to read excel file in Chinese character [R]?

2019-08-23 06:12发布

I always convert excel file into CSV file to import to R as following.

myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)

But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.

So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB. It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded

How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').

FYI, I can see full chinese character in the original excel file.

标签: r excel csv
1条回答
Deceive 欺骗
2楼-- · 2019-08-23 06:28

Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.

Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.

If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.

Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:

  1. Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
  2. Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.

The first one is simpler, the second one can handle really large file.

Hope it helps.

查看更多
登录 后发表回答