UTF-8 encoding problems with R

2019-03-31 20:44发布

问题:

Trying to parse Senate statements from the Mexican Senate, but having trouble with UTF-8 encodings of the web page.

This html comes through clearly:

library(rvest)
Senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")

Here is an example of a bit of the webpage:

"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora.   Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."

As can be seen, both accents and the "ñ" come through fine.

The issue arises in some other htmls (of the same domain!). For example:

 Senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")

I get:

 "-EL C. DIPUTADO ADAME ALEMÃÂN: En consecuencia está a discusión la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadora…….."

On this second piece I've tried iconv() and coercing the encoding parameter on html() to encoding="UTF-8" but keep getting the same results.

I've also checked the webpage encoding using W3 Validator and it seems to be UTF-8 and have no issues.

Using gsub does not seem efficient as the encoding downloads different characters with the same "code":

í - ÃÂ
á - ÃÂ
ó - ÃÂ

Pretty much fresh out of ideas.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] grDevices utils     datasets  graphics  stats     grid      methods   base     

other attached packages:
 [1] stringi_0.4-1    magrittr_1.5     selectr_0.2-3    rvest_0.2.0      ggplot2_1.0.0    geosphere_1.3-11 fields_7.1      
 [8] maps_2.3-9       spam_1.0-1       sp_1.0-17        SOAR_0.99-11     data.table_1.9.4 reshape2_1.4.1   xlsx_0.5.7      
[15] xlsxjars_0.6.1   rJava_0.9-6     

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 digest_0.6.8     evaluate_0.5.5   formatR_1.0      gtable_0.1.2    
 [8] httr_0.6.1       knitr_1.8        lattice_0.20-29  MASS_7.3-35      munsell_0.4.2    plotly_0.5.17    plyr_1.8.1      
[15] proto_0.3-10     Rcpp_0.11.3      RCurl_1.95-4.5   RJSONIO_1.3-0    scales_0.2.4     stringr_0.6.2    tools_3.1.2     
[22] XML_3.98-1.1    

UPDATE: This seems to be the issue:

stri_enc_mark(Senate2)
[1] "ASCII"  "latin1" "latin1" "ASCII"  "ASCII"  "latin1" "ASCII"  "ASCII"  "latin1"

... and so forth. Clearly, the issue is in latin1:

stri_enc_isutf8(texto2)
    [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

How can I coerce the latin1 to correct UTF-8 strings? When "translated" by stringi It appears to be doing it wrong, giving me the issues described earlier.

回答1:

Encodings are one of 21st century's worse headaches. But here's a solution for you:

# Set-up remote reading connection, specifying UTF-8 as encoding.
addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html"
read.html.con <- file(description = addr, encoding = "UTF-8", open = "rt")

# Read in cycles of 1000 characters
html.text <- c()
i = 0
while(length(html.text) == i) {
    html.text <- append(html.text, readChar(con = read.html.con,nchars = 1000))
    cat(i <- i + 1)
}

# close reading connection
close(read.html.con)

# Paste everything back together & at the same time, convert from UTF-8 
# to... UTF-8 with iconv(). I know. It's crazy. Encodings are secretely 
# meant to drive us insane.
content <- paste0(iconv(html.text, from="UTF-8", to = "UTF-8"), collapse="")

# Set-up local writing
outpath <- "~/htmlfile.html"

# Create file connection specifying "UTF-8" as encoding, once more
# (Although this one makes sense)
write.html.con <- file(description = outpath, open = "w", encoding = "UTF-8")

# Use capture.output to dump everything back into the html file
# Using cat inside it will prevent having [1]'s, quotes and such parasites
capture.output(cat(content), file = write.html.con)

# Close the output connection
close(write.html.con)

Then you're ready to open your newly created file in your favorite browser. You should see it intact and have it ready to be reopened with the tools of your choosing!



回答2:

I think I have an idea what Dominic's twist does. See a related topic here answered by Hadley

Your problem is almost surely that the UTF-8 file comes with a BOM mark. BOM's were introduced to R 3.0.0 and many packages do not handle them. A usual workaround had always been to save the file in text file, open it in a program that handles BOM, such as Windows Notepad or OpenOffice Calc, resave it, and then reopen it. Dirty trick, but it can be reproducable as the base R read.table / read.csv family now explicitly can handle this issue.

read.csv(..., fileEncoding = "UTF-8-BOM")    

I think Dominic's trick is related to this. Some people say that UTF-8-BOM is a legacy issue and will fade, but I do not think so, so I think it would be great if there would be more explicit ways to address the issue.

You can always check if a garbled UTF-8 works in OpenOffice Calc, on Windows on Notepad, or just reads well after write.csv / read.csv or other text write/read functions.