Scraping Javascript Generated Content in R

I find web scraping tasks in R can often be achieved with easy to use rvest package by fetching the html code that generates a webpage. This „usual“ approach (as I may call it), however, seems to miss some functionality when the website uses Javascript to display the relevant data. As a working example, I would like to scrape news headlines from this website. The two main obstacles for the usual approach include the „load more“ button at the bottom and the extraction of the headlines using xpath. In particular:

library(rvest)
library(magrittr)

url = "http://www.nestle.com/media/news-archive#agregator-search-results"
webs = read_html(url)

# Headline of the first news based on its xpath
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[1]") %>% html_text
#[1] ""

# Same for the description of the first news
webs %>% html_nodes(xpath="//*[@id='agregator-search-results']/span[2]/ul/li[1]/a/span[2]/span[2]") %>% html_text
#[1] ""

Maybe someone can shed light on (one of) the following questions:

Do I miss something obvious here? That is, is it possible to scrape the headlines using the usual approach based on rvest in this case? As to my current understanding, however, that is not the case.
Is RSelenium and phantom JS the only way to go here? To put it different, can the task be achieved without the use of RSelenium and phantomJS, in particular? This could include either the extraction of the headlines or loading more headlines (or both).

Any input is appreciated.

标签： r web-scraping rvest

1条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-09-09 18:39

Imo, it's sometimes better to look for the raw data in the background:

library(jsonlite)
library(RCurl)
n <- 8 # number of news items to pull
useragent <- "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0"
url <- sprintf("http://www.nestle.com/_handlers/advancedsearch.ashx?q=Nestle%%2Bdaterange%%3A..2016-01-05&index=0&num=%d&client=Nestle_Corp&site=Nestle_Corp_Media&requiredfields=MediaType:/media/pressreleases/allpressreleases|MediaType:/Media/NewsAndFeatures|MediaType:/Media/News&sort=date:D:R:d1&filter=p&access=p&entsp=a&oe=UTF-8&ie=UTF-8&ud=1&ProxyReload=1&exclude_apps=1&entqr=3&getfields=*", n)
json <- getURL(url, useragent=useragent)
res <- fromJSON(json)
df <- res$GSP$RES$R
head(cbind(df[, c("U", "T")], df$FS$'@VALUE'))
#                                                                                                U                                                                                 T df$FS$"@VALUE"
# 1                                   http://www.nestle.com/media/newsandfeatures/nestle-150-years &#39;Good Food, Good Life&#39;: Celebrating 150 years of <b>Nestlé</b> <b>...</b>     2016-01-01
# 2                                   http://www.nestle.com/media/newsandfeatures/2015-in-pictures                                           2015 in pictures | <b>Nestlé</b> Global     2015-12-23
# 3                         http://www.nestle.com/media/news/nescafe-dolce-gusto-expands-in-brazil                Coffee superstar: Nescafé Dolce Gusto expands in Brazil <b>...</b>     2015-12-17
# 4                        http://www.nestle.com/media/news/nestle-waters-new-bottling-plant-italy  <b>Nestlé</b> Waters needs youth, for its new bottling plant in Italy <b>...</b>     2015-12-10
# 5 http://www.nestle.com/media/news/nestle-launch-wellness-club-personalised-health-service-japan     Matcha made in nutritional heaven: <b>Nestlé</b> launches Wellness <b>...</b>     2015-12-08
# 6        http://www.nestle.com/media/news/nestle-completes-chf-8-billion-share-buyback-programme          <b>Nestlé</b> completes CHF 8 billion share buyback programme <b>...</b>     2015-12-07

df contains more information, some of which prly has to be unnested if you want to use it.

0人赞添加讨论(0) 举报

Scraping Javascript Generated Content in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间