encoding error with read_html

2019-03-02 08:29发布

问题:

I am trying to web scrape a page. I thought of using the package rvest. However, I'm stuck in the first step, which is to use read_html to read the content. Here´s my code:

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
                        encoding = "ISO-8895-1")

And I got the following error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]

I tried using what similar questions had as answers, but it did not solve my issue:

obra_caridade <- read_html(iconv(url, to = "UTF-8"),
                        encoding = "UTF-8")

obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
                        encoding = "ISO-8895-1")

Both attempts returned a similar error. Does anyone has any suggestion about how to solve this issue? Here's my session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.2 xml2_1.1.1 

loaded via a namespace (and not attached):
[1] httr_1.2.1   magrittr_1.5 R6_2.2.1     tools_3.3.1  curl_2.6     Rcpp_0.12.11