I'm trying to come up with a robust way to scrape the final standings of the NFL teams in each season; wonderfully, there is a Wikipedia page with links to all this info.
Unfortunately, there is a lot of inconsistency (perhaps to be expected, given the evolution of league structure) in how/where the final standings table is stored.
The saving grace should be that the relevant table is always in a section with the word "Standings".
Is there some way I can grep
a section name and only extract the table
node(s) there?
Here are some sample pages to demonstrate the structure:
1922 season - Only one division, one table; table is found under heading "Standings" and has xpath
//*[@id="mw-content-text"]/table[2]
and CSS selector#mw-content-text > table.wikitable
.1950 season - Two divisions, two tables; both found under heading "Final standings". First has xpath
//*[@id="mw-content-text"]/div[2]/table
/ CSS#mw-content-text > div:nth-child(20) > table
, second has xpath//*[@id="mw-content-text"]/div[3]/table
and selector#mw-content-text > div:nth-child(21) > table
.2000 season - Two conferences, 6 divisions, two tables; both found under heading "Final regular season standings". First has xpath
//*[@id="mw-content-text"]/div[2]/table
and selector#mw-content-text > div:nth-child(16) > table
, second has xpath//*[@id="mw-content-text"]/div[3]/table
and selector#mw-content-text > div:nth-child(17) > table
In summary:
# season | xpath | css
-------------------------------------------------------------------------------------------------
# 1922 | //*[@id="mw-content-text"]/table[2] | #mw-content-text > table.wikitable
# 1950 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(20) > table
# | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(21) > table
# 2000 | //*[@id="mw-content-text"]/div[2]/table | #mw-content-text > div:nth-child(16) > table
# | //*[@id="mw-content-text"]/div[3]/table | #mw-content-text > div:nth-child(17) > table
Scraping, e.g., 1922 would be easy:
output <- read_html("https://en.wikipedia.org/wiki/1922_NFL_season") %>%
html_node(xpath = '//*[@id="mw-content-text"]/table[2]') %>% whatever_else(...)
But I didn't see any pattern that I could use in the xpath nor the CSS selector that I could use to generalize this so I don't have to make 80 individual scraping exercises.
Is there any robust way to try and scrape all these tables, especially given the crucial insight that all the tables are located below a heading which would return TRUE
from grepl("standing", tolower(section_title))
?
You can scrape everything at once by looping the URLs with
lapply
and pulling the tables with a carefully chosen XPath selector:The XPath selector looks for
//span[contains(@id, "tandings")]
span
s with anid
withtandings
in it (e.g "Standings", "Final standings")/following::*[@title="Winning percentage" or text()="PCT"]
title
attribute of "Winning Percentage"/ancestor::table
table
node that is up the tree from that node.