Web-Scraping with R

2019-04-11 20:02发布

问题:

I'm having some problems scraping data from a website. First, I have not a lot of experience with webscraping... My intended plan is to scrape some data using R from the following website: http://spiderbook.com/company/17495/details?rel=300795

Especially, I want to extract the links to the articles on this site.

My idea so far:

xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext,  "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) 
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x)) 

But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!

Best Christoph

回答1:

You picked a tough problem to learn on.

This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...) just grabs the base html and parses that. So the links you want are simply not present.

AFAIK the only way around this is to use the RSelenium package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium.

Once you've done that, inspection of the source in a browser shows that the article links are all in the href attribute of anchor tags which have class=doclink. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer()        # download Selenium Server, if not already presnet
startServer()           # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open()            # open connection
remDr$navigate(url)     # grab and process the page (including scripts)
doc   <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"                                                                                    
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
# [7] "http://www.calcharge.org/2014/07/"                                                                                    
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"


回答2:

As @jihoward notes RSelenium will solve this problem and wont require inspection of net traffic/ dissection of the underlying website to find the appropriate quantities. Also I would note that RSelenium can run without Selenium Server if phantomjs is installed on the users system. In this case RSelenium can drive phantomjs directly. There is a vignette relating to headless browsing at http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

Inspecting web traffic with browser

In this case however inspection of the traffic yields the following json file being called http://spiderbook.com/company/details/docs?rel=300795&docs_page=0 and it is not cookie protected or sensitive to user-agent string etc.. In this instance the following can be done:

library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"                                                                                    
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
[7] "http://www.calcharge.org/2014/07/"                                                                                    
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"      

Inspecting web traffic writing a simple function for phantomJS

With RSelenium and phantomJS we could also use it to inspect the traffic on the fly (currently only when driving phantomJS directly). As a simple example we record the requested and received calls from the current webpage we are viewing and store in "traffic.txt" in our current working directory:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
             var fs = require(\"fs\");
             fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
             page.onResourceRequested = function(request) {
                fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
             };
             page.onResourceReceived = function(response) {
                fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
             };"

result <- remDr$phantomExecute(psScript)

remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"                                                        
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"      
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"      
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"        
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"          

> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

pJS$stop() # stop phantomJS

Here we can see one of the received files was "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0".

Using phantomJS/ghostdriver built-in HAR support to inspect traffic

In fact phantomJS/ghostscript creates its own HAR files so just by browsing to the page when we are driving phantomJS we already have access to all the request/response data:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
# HAR contain alot of detail will just illustrate here accessing the data
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))

> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

So the last example shows interogating the HAR file produced by phantomJS to find the XMLHttpRequest requests and then return the specific url which as we would hope corresponds to the json file we found at the beginning of the answer.



回答3:

Any browser's network inspector will tell you where its getting the data from. In this case it seems to be JSON via http://spiderbook.com/company/details/docs?rel=300795 - which means its a two liner with jsonlite:

> require(jsonlite)
> x=fromJSON("http://spiderbook.com/company/details/docs?rel=300795")
> x$data$url
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"                                                                                    
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
[7] "http://www.calcharge.org/2014/07/"                                                                                    
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"      

I suspect this part of the JSON tells you if the returned data has more pages:

> x$has_next
[1] FALSE

and then I suspect there's a parameter to the URL that gets data from a certain page.

How do you get the JSON url from the public URL? I'm not totally sure since I don't know what the "17495" is doing there...