Web Scraping multiple Links using R

2019-09-15 17:06发布

I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.

jump <- seq(1, 10, by = 1)

site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")


dflist <- lapply(site, function(i) {
   webpage <- read_html(i)
  draft_table <- html_nodes(webpage,'.excerpt')
  draft <- html_text(draft_table)
})



finaldf <- do.call(cbind, dflist) 

finaldf_10<-data.frame(finaldf)

View(finaldf_10)

Below is the link from where I need to scrape the data which has 127 pages.

[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]

As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.

标签： r web-scraping lapply rvest

1条回答

小情绪 Triste *

2楼-- · 2019-09-15 17:55

Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md

In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).

Here is a working code :

library(rvest)
library(purrr)

dflist <- map(.x = 1:10, .f = function(x) {
  Sys.sleep(5)
  url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
  read_html(url) %>%
    html_nodes('.excerpt') %>%
    html_text() %>%
    as.data.frame()
}) %>% do.call(rbind, .)

Best,

Colin

0人赞添加讨论(0) 举报

Web Scraping multiple Links using R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间