I'm having some problems scraping data from a website. First, I have not a lot of experience with webscraping...
My intended plan is to scrape some data using R from the following website:
http://spiderbook.com/company/17495/details?rel=300795
Especially, I want to extract the links to the articles on this site.
My idea so far:
xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext, "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " ")))
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x))
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
Best
Christoph
You picked a tough problem to learn on.
This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...)
just grabs the base html and parses that. So the links you want are simply not present.
AFAIK the only way around this is to use the RSelenium
package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium
is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium
.
Once you've done that, inspection of the source in a browser shows that the article links are all in the href
attribute of anchor tags which have class=doclink
. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.
library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer() # download Selenium Server, if not already presnet
startServer() # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open() # open connection
remDr$navigate(url) # grab and process the page (including scripts)
doc <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
# [7] "http://www.calcharge.org/2014/07/"
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
As @jihoward notes RSelenium
will solve this problem and wont require inspection of net traffic/ dissection of the underlying website to find the appropriate quantities. Also I would note that RSelenium
can run without Selenium Server
if phantomjs
is installed on the users system. In this case RSelenium
can drive phantomjs
directly. There is a vignette relating to headless browsing at http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html
Inspecting web traffic with browser
In this case however inspection of the traffic yields the following json file being called http://spiderbook.com/company/details/docs?rel=300795&docs_page=0 and it is not cookie protected or sensitive to user-agent string etc.. In this instance the following can be done:
library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
[7] "http://www.calcharge.org/2014/07/"
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
Inspecting web traffic writing a simple function for phantomJS
With RSelenium
and phantomJS
we could also use it to inspect the traffic on the fly (currently only when driving phantomJS directly). As a simple example we record the requested and received calls from the current webpage we are viewing and store in "traffic.txt" in our current working directory:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
var fs = require(\"fs\");
fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
page.onResourceRequested = function(request) {
fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
};
page.onResourceReceived = function(response) {
fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
};"
result <- remDr$phantomExecute(psScript)
remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"
> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
pJS$stop() # stop phantomJS
Here we can see one of the received files was "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
.
Using phantomJS/ghostdriver built-in HAR support to inspect traffic
In fact phantomJS/ghostscript
creates its own HAR
files so just by browsing to the page when we are driving phantomJS
we already have access to all the request/response data:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
# HAR contain alot of detail will just illustrate here accessing the data
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))
> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
So the last example shows interogating the HAR file produced by phantomJS
to find the XMLHttpRequest
requests and then return the specific url which as we would hope corresponds to the json file we found at the beginning of the answer.
Any browser's network inspector will tell you where its getting the data from. In this case it seems to be JSON via http://spiderbook.com/company/details/docs?rel=300795 - which means its a two liner with jsonlite:
> require(jsonlite)
> x=fromJSON("http://spiderbook.com/company/details/docs?rel=300795")
> x$data$url
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
[7] "http://www.calcharge.org/2014/07/"
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
I suspect this part of the JSON tells you if the returned data has more pages:
> x$has_next
[1] FALSE
and then I suspect there's a parameter to the URL that gets data from a certain page.
How do you get the JSON url from the public URL? I'm not totally sure since I don't know what the "17495" is doing there...