I wrote the following code to scrap tendering information from a portal on daily basis.
packages <- c('rvest', 'stringi', 'tidyverse','lubridate','dplyr')
purrr::walk(packages, library, character.only = TRUE, warn.conflicts = FALSE)
start_time <- proc.time()
Main Page to scrap and get total no of records.
data <- read_html('https://eprocure.gov.in/mmp/latestactivetenders')
total_tenders_raw <- html_nodes(data,xpath = '//*[(@id = "table")]')
All_tenders <- data.frame(html_table(total_tenders_raw, header = TRUE))
links <- html_nodes(data, xpath='//*[(@id = "table")] | //td | //a')
links_fair <- html_attr(links,'href')
links_fair <- links_fair[grep("tendersfullview",links_fair)]
All_tenders <- cbind(All_tenders,links_fair)
Reading the total number of records to fetch
Count_of_Recs_raw <- html_nodes(data, xpath = '//*[(@id = "edit-l-active-teners")]//div')
Count_of_Recs <- as.numeric(gsub("Total Tenders : ","",html_text(Count_of_Recs_raw[1])))
Functions for cleaning and processing data fields like dates and Factors.
process_dates <- function(data){
cols2date <- c('Bid.Submission.Closing.Date','epublished_date','document_download_start_date','bid_submission_start_date','bid_opening_date','document_download_end_date','bid_submission_end_date')
date_processed_data <- data
date_processed_data[cols2date] <- lapply(data[cols2date] , dmy_hm)
return(date_processed_data)
}
clean_process_data <- function(data){
cols2factor <- c('State.Name','product_category','pre_qualification','organisation_name','organisation_type','tender_type')
clean_processed_data <- data
clean_processed_data[cols2factor] <- lapply(data[cols2factor] , factor)
#clean_processed_data <- process_dates(clean_processed_data)
return(clean_processed_data)
}
The code below is where precisely my question lies...
Table Scrapping starts here. Page one has already been scrapped to get the structure of the data frame.
for (page_no in 2:round(Count_of_Recs/10)){
closeAllConnections()
on.exit(closeAllConnections())
url_bit1 <- 'https://eprocure.gov.in/mmp/latestactivetenders/page='
url <- paste(url_bit1, page_no, sep="")
cat(page_no,"\t",proc.time() - start_time,"\n")
data <- read_html(url)
total_tenders_raw <- html_nodes(data,xpath = '//*[(@id = "table")]')
Page_tenders <- data.frame(html_table(total_tenders_raw, header = TRUE))
links <- html_nodes(data, xpath='//*[(@id = "table")] | //td | //a')
links_fair <- html_attr(links,'href')
links_fair <- links_fair[grep("tendersfullview",links_fair)]
Page_tenders <- cbind(Page_tenders,links_fair)
All_tenders <- rbind(All_tenders,Page_tenders)
}
This for loop usually ends up taking hours to complete. I am looking for using the apply family to good effect so as to save on time. This program has further responsibility of fetching and processing all records and then for each individual record again scrapping a entirely new page every time (code not listed here)....
I have tried the following code but it doesn't give me what i want:
url_bit1 <- 'https://eprocure.gov.in/mmp/latestactivetenders/page='
read_page <- function(datain){
closeAllConnections()
on.exit(closeAllConnections())
url <- paste(url_bit1, datain$S.No., sep="")
cat(S.No.,"\t",proc.time() - start_time,"\n")
data <- read_html(url)
total_tenders_raw <- html_nodes(data,xpath = '//*[(@id = "table")]')
Page_tenders <- data.frame(html_table(total_tenders_raw, header = TRUE))
links <- html_nodes(data, xpath='//*[(@id = "table")] | //td | //a')
links_fair <- html_attr(links,'href')
links_fair <- links_fair[grep("tendersfullview",links_fair)]
Page_tenders <- cbind(Page_tenders,links_fair)
All_tenders <- rbind(All_tenders,Page_tenders)
}
All_tenders <- sapply(All_tenders, FUN=read_page(All_tenders$S.No.))
Any advise, guidance, suggestions, inputs or help is welcome. I have been using R for 3-4 months only. I am also aware of Python's strengths in this matter over R but am inclined towards R for the solution to this problem.