utf-8 characters get lost when converting from lis

2019-04-26 12:45发布

问题:

I am using R 3.2.0 with RStudio 0.98.1103 on Windows 7 64-bit. The Windows "regional and language settings" of my computer is English (United States).

For some reason the following code replaces my Czech characters "č" and "ř" by "c" and "r" in the text "Koryčany nad přehradou", when I read a XML file in utf-8 encoding from the web, parse the XML file to a list, and convert the list to a data.frame.

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

#this still displays correctly "Koryčany nad přehradou"
print(siteName) 

#make a data.frame from the list item. I suspect here is the problem.
df <- data.frame(name=siteName, id=1)

#now the Czech characters are lost. I see only "Korycany nad prehradou"
View(df) 

write.csv(df,"test.csv")
#the test.csv file also contains "Korycany nad prehradou" 
#instead of "Koryčany nad přehradou"

What is the problem? How do I make R to show my data.frame correctly with all the utf-8 special characters and save the .csv file without losing the "č" and "ř" Czech characters?

回答1:

This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:

(1) Get XML data in UTF-8 from the Internet

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:

> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
> 

(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:

df <- data.frame(name=siteName, id=1)
df
                    name id
1 Korycany nad prehradou  1

(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.

m <- as.matrix(df)
View(m)  #this shows incorrectly in RStudio
m        #however, this shows correctly in the R console.
     name                     id 
[1,] "Koryčany nad přehradou" "1"

(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.

#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")

#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale) 

(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.

write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")

(8) Set the locale back to original

Sys.setlocale("LC_CTYPE", original.locale)

(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:

data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
                     name id
1 Korycany nad prehradou  1

#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
  name                     id 
1 "Koryčany nad přehradou" "1"

So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".

If anybody has a better solution, I'll welcome your suggestions.