R: need to replace invisible/accented characters w

2019-07-04 01:17发布

I'm working with a file generated from several different machines that had different locale-settings, so I ended up with a column of a data frame with different writings for the same word:

CÓRDOBA
CÓRDOBA
CÒRDOBA

I'd like to convert all those to CORDOBA. I've tried doing

t<-gsub("Ó|Ó|Ã’|°|°|Ò","O",t,ignore.case = T) # t is the vector of names

Wich works until it finds some "invisible" characters: Invisible Characters

As you can see, I'm not able to see, in R, the additional charater that lies between à and \ (If I copy-paste to MS Word, word shows it with an empty rectangle). I've tried to dput the vector, but it shows exactly as in screen (without the "invisible" character).

I ran Encoding(t), and ir returns unknown for all values.

My system configuration follows:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252    LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
[5] LC_TIME=Spanish_Colombia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] zoo_1.7-12       dplyr_0.4.2      data.table_1.9.4

loaded via a namespace (and not attached):
 [1] R6_2.1.0        assertthat_0.1  magrittr_1.5    plyr_1.8.3      parallel_3.2.1  DBI_0.3.1       tools_3.2.1     reshape2_1.4.1  Rcpp_0.11.6     stringi_0.5-5  
[11] grid_3.2.1      stringr_1.0.0   chron_2.3-47    lattice_0.20-31

I've saveRDS a file with a data frame of actual and expected toy values, wich could be loadRDS from here. I'm not absolutely sure it will load with the same problems I have (depending on you locale), but I hope it does, so you can provide some help.

At the end, I'd like to convert all those special characters to unaccented ones (Ó to O, etc.), hopefully without having to manually input each one of the special ones into a regex (in other words, I'd like --if possible-- some sort of gsub("[:weird:]","[:equivalentToWeird:]",t). If not possible, at least I'd like to be able to find (and replace) those "invisible" characters.

Thanks,

############## EDIT TO ADD ###################

If I run the following code:

d<-readRDS("c:/path/to(downloaded/Dropbox/file/inv_char.Rdata")
stri_escape_unicode(d$actual)

This is what I get:

[1] "\\u00c3\\u201cN  N\\u00c2\\u00b0 08 \\\"CACIQUE CALARC\\u00c3\\u0081\\\" - ARMENIA"
[2] "\\u00d3N  N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA"                     
[3] "\\u00d3N  N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA(ALTERNO)" 

Normal output is:

> d$actual
[1] ÓN  N° 08 "CACIQUE CALARCÃ" - ARMENIA       ÓN  N° 08 "CACIQUE CALARCÁ" - ARMENIA          ÓN  N° 08 "CACIQUE CALARCÁ" - ARMENIA(ALTERNO)

1条回答
叼着烟拽天下
2楼-- · 2019-07-04 01:43

With the help of @hadley, who pointed me towards stringi, I ended up discovering the offending characters and replacing them. This was my initial attempt:

unweird<-function(t){
    t<-stri_escape_unicode(t)
    t<-gsub("\\\\u00c3\\\\u0081|\\\\u00c1","A",t)
    t<-gsub("\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8","E",t)
    t<-gsub("\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc","I",t)
    t<-gsub("\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba","O",t)
    t<-gsub("\\\\u00c3\\\\u2018|\\\\u00d1","N",t)
    t<-gsub("\\u00a0|\\u00c2\\u00a0","",t)
    t<-gsub("\\\\u00f3","o",t)
    t<-stri_unescape_unicode(t)
}

which produced the expected result. I was a little bit curious about other stringi functions, so I wondered if its substitution one could be faster on my 3.3 million rows. I then tried stri_replace_all_regex like this:

stri_unweird<-function(t){
  stri_unescape_unicode(stri_replace_all_regex(stri_escape_unicode(t),
    c("\\\\u00c3\\\\u0081|\\\\u00c1",
      "\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8",
      "\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc",
      "\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba",
      "\\\\u00c3\\\\u2018|\\\\u00d1",
      "\\u00a0|\\u00c2\\u00a0",
      "\\\\u00f3"),
   c("A","E","I","O","N","","o"),
   vectorize_all = F))
}

As a side note, I ran microbenchmark on both methods, these are the results:

g<-microbenchmark(unweird(t),stri_unweird(t),times = 100L)
summary(g)
       min       lq     mean   median       uq      max neval cld
1 423.0083 425.6400 431.9609 428.1031 432.6295 490.7658   100   b
2 118.5831 119.5057 121.2378 120.3550 121.8602 138.3111   100  a 
查看更多
登录 后发表回答