I'm trying to scrape data from a password protected website in R using the rvest package. My code currently logs in to the website at each iteration of a loop that will run about 15,000 times. This seems very inefficient but I have not figured out a way around it, because jumping to a different url without first logging in every time returns to the website's log in page. A simplification of my code is as follows:
library(rvest)
url <- password protected website url within quotes
session <-html_session(url)
form <-html_form(session)[[1]]
filled_form <- set_values(form,
`username` = email within quotes,
`password` = password within quotes)
start_table <- submit_form(session, filled_form) %>%
jump_to(url from which to scrape first table within quotes) %>%
html_node("table.inlayTable") %>%
html_table()
data_table <- start_table
for(i in 1:nrow(data_ids))
{
current_table <- try(submit_form(session, filled_form) %>%
jump_to(paste(first part of url within quotes, data_ids[i, ], last part of url within quotes, sep="")) %>%
html_node("table.inlayTable") %>%
html_table())
data_table <- rbind(data_table, current_table)
}
For simplicity, the way I handle any possible errors thrown within the try function is suppressed. Note that data_ids is a data frame containing the part of the url to be updated at each new iteration.
Does anyone have a suggestion for how this scraping could be achieved without logging in at each iteration of the loop?
Thank you! Yann
You can save the session in a variable but you won't save so many time I guess. Here is my script for web scraping :