Cumulative mean with conditionals

2019-05-04 18:32发布

问题:

New to R. Small rep of my df:

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df

TeamHome TeamAway PTS_TeamHome PTS_TeamAway
  LAL      IND          101           95
  HOU      LAL           87           89
  SAS      LAL           94          105
  MIA      HOU          110          111
  LAL      NOP           95          121

Imagine these are the first four games of a season with 1230 games. I want to calculate the cumulative points per game (mean) at any given time for the home team and the visiting team.

The output would look like this:

  TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1  LAL      IND          101           95                101                 95
2  HOU      LAL           87           89                 87                 95
3  SAS      LAL           94          105                 94              98.33
4  MIA      HOU          110          111                110                 99
5  LAL      NOP           95          121               97.5                121

Note that what the formula does for the fifth game for the home team. Since the LAL is the home team it looks for how many points has LAL scored when playing at home or on the road. In this case (101 + 89 + 105 + 95) / 4 = 97.5

Here is what I tried without much success:

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0]) 
                                 + cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) ) 
                             / #divided by number of games
  df$HOMETEAM_AVGCUMPTS <- unlist(lst)

I wanted to calculate the cumulative PTS and then the number of games to divide it by but none of this worked.

回答1:

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- mean(c(df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamHome[i]],
                                        df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamHome[i]]))
df$HOMETEAM_AVGCUMPTS <- unlist(lst)


lst2 <- list()
for(i in 1:nrow(df)) lst2[[i]] <- mean(c(df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamAway[i]],
                                        df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamAway[i]]))
df$ROADTEAM_AVGCUMPTS <- unlist(lst2)


df
#   TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1      LAL      IND          101           95                101                 95
# 2      HOU      LAL           87           89                 87                 95
# 3      SAS      LAL           94          105                 94           98.33333
# 4      MIA      HOU          110          111                110                 99
# 5      LAL      NOP           95          121               97.5                121

The approach is divided into two loops. We are taking the mean of two vectors. They are combined with a mean(c(vec1,vec2)) format.

The first vector is the set of points scored while the home team was at home (team in col1, pts in col3), the second vector is the set of points scored by the home team while they were away (team in col2, pts in col4). We use the for loop as it allows us to easily control how many rows are being considered in the subset. With df$PTS_TeamHome[1:i], the set is limited to the games that were played in the past and the current game. We subset that vector with [df$TeamHome[1:i] == df$TeamHome[i]]. In plain language that expression is "Teams in the "TeamHome category up to the current game that are equal to the Home team currently playing". With those parameters we will not allow "future" games to corrupt the analysis.


For the data, I set the stringsAsFactors argument to FALSE. And converted the points columns to class numeric. See below.

Data

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway), stringsAsFactors=F)
df[3:4] <- lapply(df[3:4], function(x) as.numeric(x))


回答2:

I would argue that you should restructure your data in a tidier format with two rows per game: one row for the visiting team and one row for the home team. It is much easier to work with data that is in a tidy/long format.

library(dplyr)
library(tidyr)

df %>%
  mutate(game = row_number()) %>%
  gather(location, team, TeamHome, TeamAway) %>%
  gather(location2, points, PTS_TeamHome, PTS_TeamAway) %>%
  filter(
    (location == "TeamHome" & location2 == "PTS_TeamHome") | 
      (location == "TeamAway" & location2 == "PTS_TeamAway")
  ) %>%
  select(-location2) %>%
  arrange(game) %>%
  group_by(team) %>%
  mutate(run_mean_points = cummean(points))

data

# note that cbind() is removed.

df <- data.frame(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway, stringsAsFactors = FALSE)

Source: local data frame [10 x 5]
Groups: team

   game location team points run_mean_points
1     1 TeamHome  LAL    101       101.00000
2     1 TeamAway  IND     95        95.00000
3     2 TeamHome  HOU     87        87.00000
4     2 TeamAway  LAL     89        95.00000
5     3 TeamHome  SAS     94        94.00000
6     3 TeamAway  LAL    105        98.33333
7     4 TeamHome  MIA    110       110.00000
8     4 TeamAway  HOU    111        99.00000
9     5 TeamHome  LAL     95        97.50000
10    5 TeamAway  NOP    121       121.00000


回答3:

Here's a short loop version which will only over each unique team name once (instead of every single row twice). The idea here is to preallocate a matrix with the desired size and then run a short for loop over the unique team names while filling the correct entries within the matrix. We are creating both the matrix and a temporary data set in a transposed form so the values will be filled row wise instead of column wise (Rs default) because the game sequence is row wise

## Transpose the data once
tempdf <- t(df)     
## Create transposed matrix with future column names
mat <- matrix(NA, 2, nrow(df))
rownames(mat) <- c("HOMETEAM_AVGCUMPTS", "ROADTEAM_AVGCUMPTS")    
## Create a vector of unique team names
indx <- as.character(unique(unlist(df[1:2])))
## Run the loop only over the unique team names
for (i in indx) {
  indx2 <- tempdf[1:2, ] == i               
  temp <- tempdf[3:4, ][indx2]
  mat[indx2] <- cumsum(temp)/seq_along(temp)
}
## Combine result with the original data
cbind(df, t(mat))
#   TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1      LAL      IND          101           95              101.0           95.00000
# 2      HOU      LAL           87           89               87.0           95.00000
# 3      SAS      LAL           94          105               94.0           98.33333
# 4      MIA      HOU          110          111              110.0           99.00000
# 5      LAL      NOP           95          121               97.5          121.00000


回答4:

Transpose. Here's one way, riffing on the loop in @DavidArenburg's answer:

sv <- t(df[3:4])
tv <- t(df[1:2])
df[c("homeavg","awayavg")] <- t(ave(sv,tv,FUN=cummean))

cummean comes from library(dplyr); you can switch it out for the base R analog if desired; and similarly for the column names.


Or interleave. All the transposition above is hard to follow. Instead you could interleave the vectors, using Arun's approach:

interleave <- function(a,b) c(a,b)[order(c(seq_along(a), seq_along(b)))]
unleave    <- function(x) split(x,1:2)

sv2 <- interleave(df$PTS_TeamHome,df$PTS_TeamAway)
tv2 <- interleave(df$TeamHome,df$TeamAway)

df[c("homeavg","awayavg")] <- unleave(ave(sv2,tv2,FUN=cummean))