I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.
jump <- seq(1, 10, by = 1)
site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")
dflist <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage,'.excerpt')
draft <- html_text(draft_table)
})
finaldf <- do.call(cbind, dflist)
finaldf_10<-data.frame(finaldf)
View(finaldf_10)
Below is the link from where I need to scrape the data which has 127 pages.
[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]
As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.
Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md
In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds
Sys.sleep
. I guess you can reduce it, but this may not work (I've tried with a 1 secondSys.sleep
, that didn't work).Here is a working code :
Best,
Colin