I find web scraping tasks in R can often be achieved with easy to use rvest
package by fetching the html code that generates a webpage. This „usual“ approach (as I may call it), however, seems to miss some functionality when the website uses Javascript to display the relevant data. As a working example, I would like to scrape news headlines from this website. The two main obstacles for the usual approach include the „load more“ button at the bottom and the extraction of the headlines using xpath. In particular:
library(rvest)
library(magrittr)
url = "http://www.nestle.com/media/news-archive#agregator-search-results"
webs = read_html(url)
# Headline of the first news based on its xpath
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[1]") %>% html_text
#[1] ""
# Same for the description of the first news
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[2]") %>% html_text
#[1] ""
Maybe someone can shed light on (one of) the following questions:
- Do I miss something obvious here? That is, is it possible to scrape the headlines using the usual approach based on
rvest
in this case? As to my current understanding, however, that is not the case. - Is
RSelenium
andphantom JS
the only way to go here? To put it different, can the task be achieved without the use ofRSelenium
andphantomJS
, in particular? This could include either the extraction of the headlines or loading more headlines (or both).
Any input is appreciated.
Imo, it's sometimes better to look for the raw data in the background:
df
contains more information, some of which prly has to be unnested if you want to use it.