SparkR show Chinese character wrong

2020-04-26 02:37发布

I am new to SparkR, these days I encountered a problem that after convert a file contain Chinese character into SparkR, it would not shown properly anymore. Like this:

city=c("北京","上海","杭州")
A <- as.data.frame(city)
A
  city
1 北京
2 上海
3 杭州

Then, I created a DataFram in SparkR based on that, and collect it out, eveything changed.

collect(createDataFrame(sqlContext,A))
      city
1 \027\xac
2      \nw
3    m\xde

I don't know how to transfer them back to readable Chinese character, or even I hope I can get readable character in SparkR, which should be convenient for me to debug.

I use linux server, not sure if it's related to that. Does anybody know anything about it?

Below is the sessionInfo()

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.2 (Maipo)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SparkR_1.5.2

loaded via a namespace (and not attached):
[1] tools_3.2.2

1条回答
Deceive 欺骗
2楼-- · 2020-04-26 03:11

It is a known issue (affects Unicode characters in general) and is already solved in 1.6. See SPARK-8951. You can either patch and rebuild 1.5 or upgrade to 1.6

查看更多
登录 后发表回答