Scraping a table from a section in Wikipedia

I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.

Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.

The saving grace should be that the relevant table is always in a section with the word "Standings".

Is there some way I can grep a section name and only extract the table node(s) there?

Here are some sample pages to demonstrate the structure:

1922 season - Only one division, one table; table is found under heading "Standings" and has xpath //*[@id="mw-content-text"]/table[2] and CSS selector #mw-content-text > table.wikitable.
1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table / CSS #mw-content-text > div:nth-child(20) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(21) > table.
2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table and selector #mw-content-text > div:nth-child(16) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(17) > table

In summary:

# season |                                   xpath |                                          css
-------------------------------------------------------------------------------------------------
#   1922 |     //*[@id="mw-content-text"]/table[2] |           #mw-content-text > table.wikitable
#   1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
#   2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table

Scraping, e.g., 1922 would be easy:

output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
  html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)

But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.

Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE from grepl("standing", tolower(section_title))?

标签： r xpath css-selectors rvest

1条回答

等我变得足够好

2楼-- · 2019-04-13 17:43

You can scrape everything at once by looping the URLs with lapply and pulling the tables with a carefully chosen XPath selector:

library(rvest)

lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'), 
       function(url){ 
           url %>% read_html() %>% 
               html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>% 
               html_table(fill = TRUE)
       })

The XPath selector looks for

//span[contains(@id, "tandings")]
- all spans with an id with tandings in it (e.g "Standings", "Final standings")
/following::*[@title="Winning percentage" or text()="PCT"]
- with a node after it in the HTML with
  - either a title attribute of "Winning Percentage"
  - or containing "PCT"
/ancestor::table
- and selects the table node that is up the tree from that node.

0人赞添加讨论(0) 举报

Scraping a table from a section in Wikipedia

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间