Scraping location data in rvest

2019-06-22 10:03发布

问题:

I'm currently trying to scrape latitude/longitude data from a list of urls I have using rvest. Each URL has an embedded google map with a specific location, but the urls themselves don't show the path that the API is taking.

When looking at the page source, I see that the part I'm after is here:

<script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=false">
</script>
<script type="text/javascript">
function initialize() {
var myLatlng = new google.maps.LatLng(43.805170,-70.722084);
var myOptions = {
  zoom: 16,
  center: myLatlng,
  mapTypeId: google.maps.MapTypeId.SATELLITE
}
var map = new google.maps.Map(document.getElementById("map_canvas"), myOptions);

var marker = new google.maps.Marker({
    position: myLatlng, 
    map: map,
    title:"F.E. Wood & Sons - Natural Energy"
});   

Now, if I can just get the line that has the LatLng(....) input, I can use some string parsing operations to derive the latitude and longitude values for all of the URLs.

I've written the following code to get my data:

require(rvest)
require(magrittr)
fetchLatLong<-function(url){
  url<-as.character(url)
  solNum<-html(url)%>%
    html_nodes("#map_canvas")%>%
    html_attr("script")
}

(the "map_canvas" selector was found using the selectorGadget; you can view the entire source here).

I'm having the worst time getting this to read what I'm after. I've tried many nodes and combinations of nodes, to no avail. I've played around with phantom.js, but the problem is that it's not js-rendered html content I'm after: I'm looking for the API query input, which is written into the page code (or, at least, to my amateur eye appears to be).

Does anyone have any advice?

回答1:

This seems to work:

library(rvest)
library(magrittr)
library(stringr)

pg <- html("http://biomassmagazine.com/plants/view/2285")

pg %>% 
  html_nodes("div.pad20 > script") %>% 
  extract2(2) %>% 
  html_text %>% 
  str_match_all("LatLng\\(([[:digit:]\\.\\-]+),([[:digit:]\\.\\-]+)") %>% 
  extract2(1) %>% 
  extract(2:3) -> lat_lng

lat_lng

## [1] "43.805170"  "-70.722084"