Web Scrape: Select Fields from Drop Downs, Extract

2019-05-26 00:16发布

问题:

Try to do some webscraping in R and could use some help.

I would like to extract the data in the table at this page http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx

But I would like to first select County from the left most drop down, then select Alameda County (CA) from the next dropdown, then scrape the data in the table.

This is what I have so far, but I think I know why its not working - rvest form functions are suited to filling out a basic form not selecting from drop downs on a .aspx(?). Searched around for examples of what I am trying to do but came up empty.

library(rvest)
url       <-"http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx"       
pgsession <-html_session(url)               
pgform    <-html_form(pgsession)[[1]]       

filled_form <- set_values(pgform,
                      `#atype_chosen span` = "County", 
                      `#asel_chosen span` = "Alameda Count (CA)") 
submit_form(pgsession,filled_form)

Anyway, this gives me an error "Error: Unknown field names: #atype_chosen span, #asel_chosen span". I sort of get it...I am asking R to enter County into the box without opening the drop down which isn't going to work.

If someone could point me in the right direction, I'd appreciate it.

回答1:

I monitored the requests the browser made when I selected your county and used that info to create this. It's gets you your data, just in a different way from how you went about it... The area parameter in the payload is for different counties.

update: I've added the code to get the county list and codes so you can select whatever county you want to get the data from...

library("httr")

# start by getting the counties and their codes...
url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnAOI"
headers <- add_headers(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.8",
  "Content-Length" = "16",
  "Content-Type" = "application/json; charset=UTF-8",
  "Host" = "droughtmonitor.unl.edu",
  "Origin" = "http://droughtmonitor.unl.edu",
  "Proxy-Connection" = "keep-alive",
  "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
  "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body="{'aoi':'county'}", headers, encode="json")
tmp <- content(a)[[1]]
county_df <- data.frame(text=unname(unlist(sapply(tmp, "[", "Text"))),
                  value=unname(unlist(sapply(tmp, "[", "Value"))),
                  stringsAsFactors=FALSE)

# use the code for whatever county you want in the payload below...

url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM"
payload <- "{'area':'06001', 'type':'county', 'statstype':'1'}"
headers <- add_headers(
                "Host" = "droughtmonitor.unl.edu",
                "Proxy-Connection" = "keep-alive",
                "Content-Length" = "50",
                "Accept" = "application/json, text/javascript, */*; q=0.01",
                "Origin" = "http://droughtmonitor.unl.edu",
                "X-Requested-With" = "XMLHttpRequest",
                "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
                "Content-Type" = "application/json; charset=UTF-8",
                "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
                "Accept-Encoding" = "gzip, deflate",
                "Accept-Language" = "en-US,en;q=0.8",
                "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body=payload, headers, encode="json")
tmp <- content(a)[[1]]
df <- data.frame(date=unname(unlist(sapply(tmp, "[", "Date"))),
                 d0=unname(unlist(sapply(tmp, "[", "D0"))),
                 d1=unname(unlist(sapply(tmp, "[", "D1"))),
                 d2=unname(unlist(sapply(tmp, "[", "D2"))),
                 d3=unname(unlist(sapply(tmp, "[", "D3"))),
                 d4=unname(unlist(sapply(tmp, "[", "D4"))),
                 stringsAsFactors=FALSE)