I recently discovered the rvest package in R and decided to try out some web scraping.
I wrote a small web crawler in a function so I could pipe it down to clean it up etc.
With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.
urlscrape<-function(url_list) {
library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)
#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)
#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)
Sys.sleep(2)
setTxtProgressBar(pb, i)
}
Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}
(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:
content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
I thought of other solutions and tried to implement them, but it did not work.
(2) Timelimit with setTimeLimit:
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)
(3) Test for url succes, with c increasing after the 4th scrape:
for (i in 1:n) {
while(url_success(url_list[i])==TRUE & c==i) {
None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.