Comatose web crawler in R (w/ rvest)

2019-04-13 11:50发布

问题:

I recently discovered the rvest package in R and decided to try out some web scraping.

I wrote a small web crawler in a function so I could pipe it down to clean it up etc.

With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.

urlscrape<-function(url_list) {

library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)


for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)

#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)

#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)

Sys.sleep(2)
setTxtProgressBar(pb, i)
}


Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}

(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:

content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

I thought of other solutions and tried to implement them, but it did not work.

(2) Timelimit with setTimeLimit:

length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)

(3) Test for url succes, with c increasing after the 4th scrape:

for (i in 1:n) {
        while(url_success(url_list[i])==TRUE & c==i) {

None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.

回答1:

Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.

The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:

# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE

url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"

page = tryCatch(

  evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),

  error = function(e) {
    pageIsBroken <<- TRUE; 
    return(e)
  }
)

if (pageIsBroken) {
  print(paste("Error Msg:", toString(page)))
}

rrscriptrvestrcurlfreezingweb-scrapingconnection-timeoutread-htmlevalwithtimeout



回答2:

A simple workaround to your problem is to repeat the http request until you receive a successful response from the server:

for (i in 1:n) {
    repeat {
          html <- try(read_html(url_list[i]),silent=TRUE)
          if(class(html) != "try-error") break
        }
    html %>% html_node(".price span") %>% html_text()->price[i]
}


回答3:

I encountered the same problem (read_html stalling after some web pages). In my case fetching the web page with RCurl's getURL helped. In combination with the post before you could try this:

repeat {
  rawhtml <- try(getURL(link[i], .encoding="ISO-8859-1", .mapUnicode = F),silent=TRUE)
  if(class(rawhtml) != "try-error") break
}

html<-read_html(rawhtml)