I find web scraping tasks in R can often be achieved with easy to use rvest
package by fetching the html code that generates a webpage. This „usual“ approach (as I may call it), however, seems to miss some functionality when the website uses Javascript to display the relevant data. As a working example, I would like to scrape news headlines from this website. The two main obstacles for the usual approach include the „load more“ button at the bottom and the extraction of the headlines using xpath. In particular:
library(rvest)
library(magrittr)
url = "http://www.nestle.com/media/news-archive#agregator-search-results"
webs = read_html(url)
# Headline of the first news based on its xpath
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[1]") %>% html_text
#[1] ""
# Same for the description of the first news
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[2]") %>% html_text
#[1] ""
Maybe someone can shed light on (one of) the following questions:
- Do I miss something obvious here? That is, is it possible to scrape the headlines using the usual approach based on
rvest
in this case? As to my current understanding, however, that is not the case.
- Is
RSelenium
and phantom JS
the only way to go here? To put it different, can the task be achieved without the use of RSelenium
and phantomJS
, in particular? This could include either the extraction of the headlines or loading more headlines (or both).
Any input is appreciated.
Imo, it's sometimes better to look for the raw data in the background:
library(jsonlite)
library(RCurl)
n <- 8 # number of news items to pull
useragent <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0"
url <- sprintf("http://www.nestle.com/_handlers/advancedsearch.ashx?q=Nestle%%2Bdaterange%%3A..2016-01-05&index=0&num=%d&client=Nestle_Corp&site=Nestle_Corp_Media&requiredfields=MediaType:/media/pressreleases/allpressreleases|MediaType:/Media/NewsAndFeatures|MediaType:/Media/News&sort=date:D:R:d1&filter=p&access=p&entsp=a&oe=UTF-8&ie=UTF-8&ud=1&ProxyReload=1&exclude_apps=1&entqr=3&getfields=*", n)
json <- getURL(url, useragent=useragent)
res <- fromJSON(json)
df <- res$GSP$RES$R
head(cbind(df[, c("U", "T")], df$FS$'@VALUE'))
# U T df$FS$"@VALUE"
# 1 http://www.nestle.com/media/newsandfeatures/nestle-150-years 'Good Food, Good Life': Celebrating 150 years of <b>Nestlé</b> <b>...</b> 2016-01-01
# 2 http://www.nestle.com/media/newsandfeatures/2015-in-pictures 2015 in pictures | <b>Nestlé</b> Global 2015-12-23
# 3 http://www.nestle.com/media/news/nescafe-dolce-gusto-expands-in-brazil Coffee superstar: Nescafé Dolce Gusto expands in Brazil <b>...</b> 2015-12-17
# 4 http://www.nestle.com/media/news/nestle-waters-new-bottling-plant-italy <b>Nestlé</b> Waters needs youth, for its new bottling plant in Italy <b>...</b> 2015-12-10
# 5 http://www.nestle.com/media/news/nestle-launch-wellness-club-personalised-health-service-japan Matcha made in nutritional heaven: <b>Nestlé</b> launches Wellness <b>...</b> 2015-12-08
# 6 http://www.nestle.com/media/news/nestle-completes-chf-8-billion-share-buyback-programme <b>Nestlé</b> completes CHF 8 billion share buyback programme <b>...</b> 2015-12-07
df
contains more information, some of which prly has to be unnested if you want to use it.