-->

R RMySQL query deforms japanese characters

2019-05-28 14:53发布

问题:

I am using RMySQL to connect to an aws MySQL server. It works, except character values are deformed. This question has been asked before but the fixes don't seem to work for me. Here's what I'm doing:

Make sure no connections are open:

dbListConnections(MySQL()) list()

Make sure my connection is set to use UTF-8:

dbGetQuery(credentials, "show variables like 'character_set%'")

             Variable_name                                     Value
1     character_set_client                                      utf8
2 character_set_connection                                      utf8
3   character_set_database                                      utf8
4 character_set_filesystem                                      utf8
5    character_set_results                                      utf8
6     character_set_server                                      utf8
7     character_set_system                                      utf8
8       character_sets_dir /rdsdbbin/mysql-5.5.40.R1/share/charsets/

Fetch data:

data <- dbGetQuery(credentials, Query) head(data) keyword_ja 1 \036 2 \036蜀ャ 3 \036螟\x8f 4 \036譌・譛ャ莠コ 5 \037繧、繝ゥ繧ケ繝\x88 6 \037蜿守ゥォ

When I write this data to disk Excel shows the same deformed characters, but notepad++ can somehow show the japanese as it's intended:

"keyword_ja"

"冬" "夏" "日本人" "イラスト" "収穫"

I've been trying to use functions like Encoding() and enc2utf8() in R to get it to display the characters correctly as notepad++ does, with no success.

Encoding(head(data$keyword_ja))

[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

enc2utf8(head(data$keyword_ja))

[1] "\036" "\036蜀ャ" "\036螟<8f>" "\036譌・譛ャ莠コ" "\037繧、繝ゥ繧ケ繝<88>" "\037蜿守ゥォ"

I can normally type japanese characters and R has no problem displaying them

Sys.getlocale() [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932" mystring <- "日本語入力できる" mystring [1] "日本語入力できる" Encoding(mystring) [1] "unknown"

I'm pretty desperate to figure this out so any help is very much appreciated. Please let me know if I can provide additional information.

回答1:

Based on this SO article, you might have to write your data to disk with UTF-8 encoding. Try this:

data <- dbGetQuery(credentials, Query)
con <- file('output.csv', encoding="utf8")
write.csv(data, file=con)

Then try opening output.csv in both Excel and Notepad++ and let us know the results. When you read this file back into R, it should hopefully behave as expected:

fread("test.csv", encoding="UTF-8")


回答2:

The only (awful) workaround I've found so far is to write the data to disk with write.csv, and then read it back in with fread from the data.table package. This gives results like this:

fread("test.csv", encoding="UTF-8") keyword_ja \036 \036冬 \036夏 \036日本人 \037イラスト

which is close to what I want, it has the correct japanese words but there are some hidden characters like \036 before them