I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?
The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:
scrape_test = function(link) {
slit <- str_split(link, "/") %>%
unlist()
id <- slit[5]
sem <- slit[6]
name <- link %>%
read_html(encoding = "UTF-8") %>%
html_nodes("h2") %>%
html_text() %>%
str_replace_all("\r\n", "") %>%
str_trim()
return(data.frame(id, sem, name))
}
I use the purrr-package map_df() to iterate the function:
test.data = links %>%
map_df(scrape_test)
Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:
- "In bind_rows_(x, .id) : Unequal factor levels: coercing to character"
- "closing unused connection 4 (link)"
EDIT: The following code making an object of links can be used to reproduce my results:
links <- c(rep("http://karakterstatistik.stads.ku.dk/Histogram/NMAK13032E/Winter-2013/B2", 100))
With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:
Here I do a for-loop, with a
tryCatch
block so that if the output is an error, we wait a couple of seconds and try again. We also include acounter
that moves on to the next link if we're still getting an error after five attempts. In addition, we haveif (!(links[i] %in% names(d)))
so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.