Webscrape tables on websites that use AngularJS us

2019-09-07 18:12发布

Using R (while using the packages rvest,jsonlite and httr) am trying to programmatically download all the data files available at the following URL:

http://environment.data.gov.uk/ds/survey/index.jsp#/survey?grid=TQ38

I have tried to use Chrome and use "Inspect" and then Source for the download options, but it appears to be using ng tables and AngularJS as a method to retrieve the final URL to download the dataset. The index.jsp file seems to reference to a javascript file downloads/ea.downloads.js which looks valuable, but am unsure how to find it to understand what functions I need to call.

Ideally the first result would be a data.frame or data.table with a column that has the Product and a column that has URLs of the files to be downloaded for each. This would be useful so that I can subsequently loop through the rows of the table and download each zip file.

I think this AngularJS issue is similar to this questions

web scrape with rvest

But cannot workout how my code should be adjusted for this example.

2条回答
再贱就再见
2楼-- · 2019-09-07 18:45

I am sure there is a better solution. This is not a final solution but is a start. It appears the data you are looking for is stored in a JSON file associated with the main page. Once that file is downloaded, you can then process it in order to determine desired files to download.

library(httr)
library(jsonlite)

#base URL for JSON file (found by examining files downloaded by page load)
curl <-'http://www.geostore.com/environment-agency/rest/product/OS_GB_10KM/TQ28?catalogName=Survey'
datafile<-GET(curl)

#process file and flatten to dataframe.
output<- content(datafile, as="text") %>% fromJSON(flatten=FALSE) 
#examine this dataframe to identified desired files.

#baseurl was determined by manually downloading 1 file
baseurl<- "http://www.geostore.com/environment-agency/rest/product/download/"
#sample on downloading the file given the base URL and guid.
#random selecting row 49 to test the download.
download.file(paste0(baseurl, output$guid[49]), output$fileName[49], method="auto")

The naming scheme from the site is confusing, I will leave that to the experts to determine the meaning.

查看更多
我只想做你的唯一
3楼-- · 2019-09-07 18:56

A slight expansion on Dave2e's solution demonstrating how to get the XHR JSON resource with splashr:

library(splashr) # devtools::install_github("hrbrmstr/splashr)
library(tidyverse)

splashr requires a Splash server and the pkg provides a way to start one with Docker. Read the help on the github pg and inside the pkg to find out how to use that.

vm <- start_splash() 

URL <- "http://environment.data.gov.uk/ds/survey/index.jsp#/survey?grid=TQ38"

This retrieves all the resources loaded by the page:

splash_local %>% render_har(URL) -> resources # get ALL the items the page loads

stop_splash(vm) # we don't need the splash server anymore

This targets the background XHR resource with catalogName in it. You'd still need to hunt to find this initially, but once you know the pattern, this becomes a generic operation for other grid points.

map_chr(resources$log$entries, c("request", "url")) %>% 
  grep("catalogName", ., value=TRUE) -> files_json

files_json
## [1] "http://www.geostore.com/environment-agency/rest/product/OS_GB_10KM/TQ38?catalogName=Survey"

Read that in:

guids <- jsonlite::fromJSON(files_json)

glimpse(guids)
## Observations: 98
## Variables: 12
## $ id              <int> 170653, 170659, 170560, 170565, 178307, 178189, 201556, 238...
## $ guid            <chr> "54595a8c-b267-11e6-93d3-9457a5578ca0", "63176082-b267-11e6...
## $ pyramid         <chr> "LIDAR-DSM-1M-ENGLAND-2003-EA", "LIDAR-DSM-1M-ENGLAND-2003-...
## $ tileReference   <chr> "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ...
## $ fileName        <chr> "LIDAR-DSM-1M-2003-TQ3580.zip", "LIDAR-DSM-1M-2003-TQ3585.z...
## $ coverageLayer   <chr> "LIDAR-DSM-1M-ENGLAND-2003-EA-MD-YY", "LIDAR-DSM-1M-ENGLAND...
## $ fileSize        <int> 76177943, 52109669, 59326278, 18048623, 13204420, 11919071,...
## $ descriptiveName <chr> "LIDAR Tiles DSM at 1m spatial resolution 2003", "LIDAR Til...
## $ description     <chr> "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m",...
## $ groupName       <chr> "LIDAR-DSM-TIMESTAMPED-ENGLAND-2003-EA", "LIDAR-DSM-TIMESTA...
## $ displayOrder    <int> -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,...
## $ metaDataUrl     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://data.g...

The rest is similar to the other answer:

dl_base <- "http://www.geostore.com/environment-agency/rest/product/download"
urls <- sprintf("%s/%s", dl_base, guids$guid)

Be kind to your network and their server:

walk2(urls, guids$fileName, download.file)

Do this if you think your system and their server can handle 98 simultaneous 70-100MB file downloads

download.file(urls, guids$fileName)
查看更多
登录 后发表回答