I am using RMySQL to connect to an aws MySQL server. It works, except character values are deformed. This question has been asked before but the fixes don't seem to work for me. Here's what I'm doing:
Make sure no connections are open:
dbListConnections(MySQL()) list()
Make sure my connection is set to use UTF-8:
dbGetQuery(credentials, "show variables like 'character_set%'")
Variable_name Value
1 character_set_client utf8
2 character_set_connection utf8
3 character_set_database utf8
4 character_set_filesystem utf8
5 character_set_results utf8
6 character_set_server utf8
7 character_set_system utf8
8 character_sets_dir /rdsdbbin/mysql-5.5.40.R1/share/charsets/
Fetch data:
data <- dbGetQuery(credentials, Query) head(data) keyword_ja 1 \036 2 \036蜀ャ 3 \036螟\x8f 4 \036譌・譛ャ莠コ 5 \037繧、繝ゥ繧ケ繝\x88 6 \037蜿守ゥォ
When I write this data to disk Excel shows the same deformed characters, but notepad++ can somehow show the japanese as it's intended:
"keyword_ja"
"冬" "夏" "日本人" "イラスト" "収穫"
I've been trying to use functions like Encoding() and enc2utf8() in R to get it to display the characters correctly as notepad++ does, with no success.
Encoding(head(data$keyword_ja))
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
enc2utf8(head(data$keyword_ja))
[1] "\036" "\036蜀ャ" "\036螟<8f>" "\036譌・譛ャ莠コ" "\037繧、繝ゥ繧ケ繝<88>" "\037蜿守ゥォ"
I can normally type japanese characters and R has no problem displaying them
Sys.getlocale() [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932" mystring <- "日本語入力できる" mystring [1] "日本語入力できる" Encoding(mystring) [1] "unknown"
I'm pretty desperate to figure this out so any help is very much appreciated. Please let me know if I can provide additional information.
Based on this SO article, you might have to write your data to disk with UTF-8 encoding. Try this:
Then try opening
output.csv
in both Excel and Notepad++ and let us know the results. When you read this file back into R, it should hopefully behave as expected:The only (awful) workaround I've found so far is to write the data to disk with write.csv, and then read it back in with fread from the data.table package. This gives results like this:
fread("test.csv", encoding="UTF-8") keyword_ja \036 \036冬 \036夏 \036日本人 \037イラスト
which is close to what I want, it has the correct japanese words but there are some hidden characters like \036 before them