There is this strange behavior of stringr
, which is really annoying me. stringr
changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim
a vector of characters, then those with exotic letters will be converted to a new Encoding.
letter1 <- readline('Gimme an ASCII character!') # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters) # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'
This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.
Any work-around?
EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste
conserves it, but not sub
for instance.
Encoding(paste(' ', Letters)) # 'unknown'
Encoding(str_c(' ', Letters)) # mixed
Encoding(sub('^ +', '', paste(' ', Letters))) # mixed
R doesn’t always make it easy to convert between encodings (there’s the function iconv
for that but what this function accepts is platform dependent). However, at the very least you can always reset the encoding marking of a string to “unknown”:
Letters = str_trim(Letters)
Encoding(Letters)
# [1] "unknown" "UTF-8"
Encoding(Letters) = ''
Encoding(Letters)
# [1] "unknown" "unknown"
However, note that this only marks the encoding of a string, it doesn’t actually re-encode the string. As a consequence, this can lead to garbled data. As mentioned in the comments, this is at best a hack, not an actual fix for the problem.
Encoding
exemplifies R’s trouble to work properly with encodings. The documentation says:
ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.
… which is obviously not helpful at all (and also more than a bit misleading; an UTF-8 string consisting only of code points < 128 may look indistinguishable to an ASCII string but operating on it should yield different results depending on encoding, which is why it should effectively be marked).
Interestingly, neither enc2native
nor enc2utf8
will do the desired thing here — both will yield in different encodings for the two strings in Letters
, a direct consequence of the Encoding
problem cited above.
stringr
is changing the encoding because stringr
is a wrapper around the stringi
package, and stringi
always encodes in UTF-8. See help("stringi-encoding", package = "stringi")
for details and an explanation of this design choice.
To avoid problems with merging data.table
s, just make sure all the id
variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8
in the stringi
package, or using iconv
.
With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique()
and duplicated()
.
See news item (23) under bugs for v1.9.7 in README.md.
Please test and write back if you face any further issues.