I want to scrape data from hockey-reference.com, specifically from this link:
https://www.hockey-reference.com/leagues/NHL_1991.html
I want the 4th table, called "Team Statistics," and I also want to subtract the first and last rows (but that can be for another time).
Initially, I want to get the scrape working with the 1991 link, but I want to eventually scrape every link from 1991 to 2017.
library(tidyverse)
library(rvest)
stat_urls <- "https://www.hockey-reference.com/leagues/NHL_1991.html"
Right now, I have the 1991 link only, for simplicity. I cannot seem to find the correct css selection, even though I have tried multiple different ones, after a pretty thorough search using the "inspect" source of the actual webpage. I have tried the following css selections:
table#stats.sortable.stats_table.now.sortable
#stats
#all_stats
#all_stats > div.table_outer_container
#stats
#stats > tbody
#div_stats (and all sorts of combos with this one)
None of these work, when used in the following code:
team_stats <- stat_urls %>%
read_html() %>%
html_nodes("#stats") %>%
html_table(header = T)
All attempts with "xpath=" also failed. Any help with this would be absolutely phenomenal, and Go Preds!
You can try using RSelenium. Saw a similar answer here: Web Scraping Basketball Reference using R.
It's a pain to install selenium though and I would try to help with that too but I installed it a while ago so don't really remember. Good luck
From the guy who posted the original question:
The above answer worked, but I had to go through Homebrew:
And then I had to use the following code from here:
Using Selenium on Mac Chrome