Missing elements when using `read_html` using `rve

I'm trying to use the read_html function in the rvest package, but have come across a problem I am struggling with.

For example, if I were trying to read in the bottom table that appears on this page, I would use the following code:

library(rvest)
html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now")

By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table> tag (specifically, it is all contained within <table class="t-calc">). But when I try to extract this using:

tables <- html_nodes(html_content, xpath = '//table')

I retrieve the following:

> tables
{xml_nodeset (4)}
[1] <table class="tippingpointroi unexpanded">\n  <tbody>\n    <tr data-state="FL" class=" "> ...
[2] <table class="tippingpointroi unexpanded">\n  <tbody>\n    <tr data-state="NV" class=" "> ...
[3] <table class="scenarios">\n  <tbody/>\n  <tr data-id="1">\n    <td class="description">El ...
[4] <table class="t-desktop t-polls">\n  <thead>\n    <tr class="th-row">\n      <th class="t ...

Which includes some of the table elements on the page, but not the one I am interested in.

Any suggestions on where I am going wrong would be most appreciated!

标签： html r web-scraping rvest

1条回答

太酷不给撩

2楼-- · 2019-06-01 02:41

The table is built dynamically from data in JavaScript variables on the page itself. Either use RSelenium to grab the text of the page after it's rendered and pass the page into rvest OR grab a treasure trove of all the data by using V8:

library(rvest)
library(V8)

URL <- "http://projects.fivethirtyeight.com/2016-election-forecast/washington/#now"

pg <- read_html(URL)

js <- html_nodes(pg, xpath=".//script[contains(., 'race.model')]") %>%  html_text()

ctx <- v8()
ctx$eval(JS(js))

race <- ctx$get("race", simplifyVector=FALSE)

str(race) ## output too large to paste here

If they ever change the formatting of the JavaScript (it's an automated process so it's unlikely but you never know) then the RSelenium approach will be better provided they don't change the format of the table structure (again, unlikely, but you never know).

0人赞添加讨论(0) 举报

Missing elements when using `read_html` using `rve

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间