Trying to parse Senate statements from the Mexican Senate, but having trouble with UTF-8 encodings of the web page.
This html comes through clearly:
library(rvest)
Senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")
Here is an example of a bit of the webpage:
"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora. Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."
As can be seen, both accents and the "ñ" come through fine.
The issue arises in some other htmls (of the same domain!). For example:
Senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")
I get:
"-EL C. DIPUTADO ADAME ALEMÃÂN: En consecuencia está a discusión la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadora…….."
On this second piece I've tried iconv() and coercing the encoding parameter on html() to encoding="UTF-8" but keep getting the same results.
I've also checked the webpage encoding using W3 Validator and it seems to be UTF-8 and have no issues.
Using gsub does not seem efficient as the encoding downloads different characters with the same "code":
í - ÃÂ
á - ÃÂ
ó - ÃÂ
Pretty much fresh out of ideas.
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] grDevices utils datasets graphics stats grid methods base
other attached packages:
[1] stringi_0.4-1 magrittr_1.5 selectr_0.2-3 rvest_0.2.0 ggplot2_1.0.0 geosphere_1.3-11 fields_7.1
[8] maps_2.3-9 spam_1.0-1 sp_1.0-17 SOAR_0.99-11 data.table_1.9.4 reshape2_1.4.1 xlsx_0.5.7
[15] xlsxjars_0.6.1 rJava_0.9-6
loaded via a namespace (and not attached):
[1] bitops_1.0-6 chron_2.3-45 colorspace_1.2-4 digest_0.6.8 evaluate_0.5.5 formatR_1.0 gtable_0.1.2
[8] httr_0.6.1 knitr_1.8 lattice_0.20-29 MASS_7.3-35 munsell_0.4.2 plotly_0.5.17 plyr_1.8.1
[15] proto_0.3-10 Rcpp_0.11.3 RCurl_1.95-4.5 RJSONIO_1.3-0 scales_0.2.4 stringr_0.6.2 tools_3.1.2
[22] XML_3.98-1.1
UPDATE: This seems to be the issue:
stri_enc_mark(Senate2)
[1] "ASCII" "latin1" "latin1" "ASCII" "ASCII" "latin1" "ASCII" "ASCII" "latin1"
... and so forth. Clearly, the issue is in latin1:
stri_enc_isutf8(texto2)
[1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
How can I coerce the latin1 to correct UTF-8 strings? When "translated" by stringi It appears to be doing it wrong, giving me the issues described earlier.