rvest: follow different links with same tag

2019-08-10 17:45发布

I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data:

http://www.sports-reference.com/cfb/years/2007-schedule.html.

As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the hyperlinks from the table (which I have done) and then do something like:

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
  stats = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()
}

it will scrape from the first link corresponding to each date. For example there were 11 games that took place on Aug 30, 2007, but if I put that in the follow_link function, it grabs data from the first game (Boise St. Weber St.) every time. Is there any way I can specify that I want it to move down the table?

I have already found a workaround by finding out the formula for the urls to which the date hyperlinks take you, but it's a pretty convoluted process, so I thought I'd see if anyone knew how to do it this way.

2条回答
再贱就再见
2楼-- · 2019-08-10 18:06

This is a complete example:

library(rvest)
library(dplyr)
library(pbapply)

# Get the main page

URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)

# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")

# I'm only limiting to 10 since I rly don't care about football 
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free

scoring_games <- pblapply(links[1:10], function(x) {

  game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
  scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
  colnames(scoring) <- scoring[1,]
  filter(scoring[-1,], !Player %in% c("", "Player"))

})

# you can bind_rows them all together but you should 
# probably add a column for the game then

bind_rows(scoring_games)

## Source: local data frame [27 x 11]
## 
##             Player            School   Cmp   Att   Pct   Yds   Y/A  AY/A    TD   Int  Rate
##              (chr)             (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1     Taylor Tharp       Boise State    14    19  73.7   184   9.7  10.7     1     0 172.4
## 2       Nick Lomax       Boise State     1     5  20.0     5   1.0   1.0     0     0  28.4
## 3    Ricky Cookman       Boise State     1     2  50.0     9   4.5 -18.0     0     1 -12.2
## 4         Ben Mauk        Cincinnati    18    27  66.7   244   9.0   8.9     2     1 159.6
## 5        Tony Pike        Cincinnati     6     9  66.7    57   6.3   8.6     1     0 156.5
## 6   Julian Edelman        Kent State    17    26  65.4   161   6.2   3.5     1     2 114.7
## 7       Bret Meyer        Iowa State    14    23  60.9   148   6.4   3.4     1     2 111.9
## 8       Matt Flynn   Louisiana State    12    19  63.2   128   6.7   8.8     2     0 154.5
## 9  Ryan Perrilloux   Louisiana State     2     3  66.7    21   7.0  13.7     1     0 235.5
## 10   Michael Henig Mississippi State    11    28  39.3   120   4.3  -5.4     0     6  32.4
## ..             ...               ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
查看更多
爱情/是我丢掉的垃圾
3楼-- · 2019-08-10 18:25

you are going over a loop, but setting to the same variable ever time, try this:

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
    stats[i] = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()

}

查看更多
登录 后发表回答