Web Scraping multiple Links using R

2019-09-15 17:06发布

I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.

jump <- seq(1, 10, by = 1)

site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")


dflist <- lapply(site, function(i) {
   webpage <- read_html(i)
  draft_table <- html_nodes(webpage,'.excerpt')
  draft <- html_text(draft_table)
})



finaldf <- do.call(cbind, dflist) 

finaldf_10<-data.frame(finaldf)

View(finaldf_10)

Below is the link from where I need to scrape the data which has 127 pages.

[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]

As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.

1条回答
小情绪 Triste *
2楼-- · 2019-09-15 17:55

Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).

Here is a working code :

library(rvest)
library(purrr)

dflist <- map(.x = 1:10, .f = function(x) {
  Sys.sleep(5)
  url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
  read_html(url) %>%
    html_nodes('.excerpt') %>%
    html_text() %>%
    as.data.frame()
}) %>% do.call(rbind, .)

Best,

Colin

查看更多
登录 后发表回答