I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data:
http://www.sports-reference.com/cfb/years/2007-schedule.html.
As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the hyperlinks from the table (which I have done) and then do something like:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
it will scrape from the first link corresponding to each date. For example there were 11 games that took place on Aug 30, 2007, but if I put that in the follow_link function, it grabs data from the first game (Boise St. Weber St.) every time. Is there any way I can specify that I want it to move down the table?
I have already found a workaround by finding out the formula for the urls to which the date hyperlinks take you, but it's a pretty convoluted process, so I thought I'd see if anyone knew how to do it this way.
This is a complete example:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...
you are going over a loop, but setting to the same variable ever time, try this:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}