Why is stringr changing encoding when manipulating

2019-04-26 19:35发布

问题:

There is this strange behavior of stringr, which is really annoying me. stringr changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim a vector of characters, then those with exotic letters will be converted to a new Encoding.

letter1 <- readline('Gimme an ASCII character!')     # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters)           # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'

This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.

Any work-around?

EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste conserves it, but not sub for instance.

 Encoding(paste(' ', Letters))                 # 'unknown'
 Encoding(str_c(' ', Letters))                 # mixed
 Encoding(sub('^ +', '', paste(' ', Letters))) # mixed

回答1:

R doesn’t always make it easy to convert between encodings (there’s the function iconv for that but what this function accepts is platform dependent). However, at the very least you can always reset the encoding marking of a string to “unknown”:

Letters = str_trim(Letters)
Encoding(Letters)
# [1] "unknown" "UTF-8"
Encoding(Letters) = ''
Encoding(Letters)
# [1] "unknown" "unknown"

However, note that this only marks the encoding of a string, it doesn’t actually re-encode the string. As a consequence, this can lead to garbled data. As mentioned in the comments, this is at best a hack, not an actual fix for the problem.

Encoding exemplifies R’s trouble to work properly with encodings. The documentation says:

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

… which is obviously not helpful at all (and also more than a bit misleading; an UTF-8 string consisting only of code points < 128 may look indistinguishable to an ASCII string but operating on it should yield different results depending on encoding, which is why it should effectively be marked).

Interestingly, neither enc2native nor enc2utf8 will do the desired thing here — both will yield in different encodings for the two strings in Letters, a direct consequence of the Encoding problem cited above.



回答2:

stringr is changing the encoding because stringr is a wrapper around the stringi package, and stringi always encodes in UTF-8. See help("stringi-encoding", package = "stringi") for details and an explanation of this design choice.

To avoid problems with merging data.tables, just make sure all the id variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8 in the stringi package, or using iconv.



回答3:

With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique() and duplicated().

See news item (23) under bugs for v1.9.7 in README.md.

Please test and write back if you face any further issues.