Scraping a table from a section in Wikipedia

2019-04-13 17:09发布

I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.

Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.

The saving grace should be that the relevant table is always in a section with the word "Standings".

Is there some way I can grep a section name and only extract the table node(s) there?

Here are some sample pages to demonstrate the structure:

  • 1922 season - Only one division, one table; table is found under heading "Standings" and has xpath //*[@id="mw-content-text"]/table[2] and CSS selector #mw-content-text > table.wikitable.

  • 1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath //*[@id="mw-content-text"]/div[2]/table / CSS #mw-content-text > div:nth-child(20) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(21) > table.

  • 2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath //*[@id="mw-content-text"]/div[2]/table and selector #mw-content-text > div:nth-child(16) > table, second has xpath //*[@id="mw-content-text"]/div[3]/table and selector #mw-content-text > div:nth-child(17) > table

In summary:

# season |                                   xpath |                                          css
-------------------------------------------------------------------------------------------------
#   1922 |     //*[@id="mw-content-text"]/table[2] |           #mw-content-text > table.wikitable
#   1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
#   2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
#        | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table

Scraping, e.g., 1922 would be easy:

output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
  html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)

But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.

Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE from grepl("standing", tolower(section_title))?

1条回答
等我变得足够好
2楼-- · 2019-04-13 17:43

You can scrape everything at once by looping the URLs with lapply and pulling the tables with a carefully chosen XPath selector:

library(rvest)

lapply(paste0('https://en.wikipedia.org/wiki/', 1920:2015, '_NFL_season'), 
       function(url){ 
           url %>% read_html() %>% 
               html_nodes(xpath = '//span[contains(@id, "tandings")]/following::*[@title="Winning percentage" or text()="PCT"]/ancestor::table') %>% 
               html_table(fill = TRUE)
       })

The XPath selector looks for

  • //span[contains(@id, "tandings")]
    • all spans with an id with tandings in it (e.g "Standings", "Final standings")
  • /following::*[@title="Winning percentage" or text()="PCT"]
    • with a node after it in the HTML with
      • either a title attribute of "Winning Percentage"
      • or containing "PCT"
  • /ancestor::table
    • and selects the table node that is up the tree from that node.
查看更多
登录 后发表回答