How do I make my for loop properly calculate means

2020-04-19 05:14发布

I have data on all the NCAA basketball games that have occurred since 2003. I am trying to implement a for loop that will calculate the average of a number of stats for each time at a point in time. Here is my for loop:

library(data.table)

roll_season_team_stats <- NULL

for (i in 0:max(stats_DT$DayNum)) {
  stats <- stats_DT[DayNum < i]
  roll_stats <- dcast(stats_DT, TeamID+Season~.,fun=mean,na.rm=T,value.var = c('FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'TO'))
  roll_stats$DayNum <- i + 1
  roll_season_team_stats <- rbind(roll_season_team_stats, roll_stats)
}

Here is the output from dput:

structure(list(Season = c(2003L, 2003L, 2003L, 2003L, 2003L, 
2003L, 2003L, 2003L, 2003L, 2003L), DayNum = c(10L, 10L, 11L, 
11L, 11L, 11L, 12L, 12L, 12L, 12L), TeamID = c(1104L, 1272L, 
1266L, 1296L, 1400L, 1458L, 1161L, 1186L, 1194L, 1458L), FGM = c(27L, 
26L, 24L, 18L, 30L, 26L, 23L, 28L, 28L, 32L), FGA = c(58L, 62L, 
58L, 38L, 61L, 57L, 55L, 62L, 58L, 67L), FGM3 = c(3L, 8L, 8L, 
3L, 6L, 6L, 2L, 4L, 5L, 5L), FGA3 = c(14L, 20L, 18L, 9L, 14L, 
12L, 8L, 14L, 11L, 17L), FTM = c(11L, 10L, 17L, 17L, 11L, 23L, 
32L, 15L, 10L, 15L), FTA = c(18L, 19L, 29L, 31L, 13L, 27L, 39L, 
21L, 18L, 19L), OR = c(14L, 15L, 17L, 6L, 17L, 12L, 13L, 13L, 
9L, 14L), DR = c(24L, 28L, 26L, 19L, 22L, 24L, 18L, 35L, 22L, 
22L), TO = c(23L, 13L, 10L, 12L, 14L, 9L, 17L, 19L, 17L, 6L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x102004ae0>)

The loop runs successfully but it is not producing the correct output. Rather than showing the team averages over time, it is giving me the same number (what I assume is the overall mean of each stat) for each day. Any ideas what is wrong with my loop? Thanks!

2条回答
不美不萌又怎样
2楼-- · 2020-04-19 05:50

Avoid growing objects in a loop which leads to excessive copying in memory. Instead, build a list of data frames to be row binded once outside the loop.

dt_list <- lapply(0:max(stats_DT$DayNum), function(i)
              tryCatch(
                  dcast(stats_DT[DayNum < i], 
                        TeamID + Season ~ ., fun=mean, na.rm=TRUE,
                        value.var = c('FGM', 'FGA', 'FGM3', 'FGA3', 
                                      'FTM', 'FTA', 'OR', 'DR', 'TO')
                       )[, DayNum := i + 1],
                       error = function(e) NULL)
           )        

roll_season_team_stats <- data.table::rbindlist(dt_list)

In fact, you may be able to do this in base R with aggregate on data frames:

stats_DF <- data.frame(stats_DT)

df_list <- lapply(0:max(stats_DT$DayNum), function(i)
              tryCatch(
                 transform(aggregate(cbind(FGM, FGA, FGM3, FGA3, 
                                           FTM, FTA, OR, DR) ~ TeamID + Season, 
                                     stats_DF[stats_DF$DayNum < i,],
                                     FUN = mean,
                                     na.rm = TRUE),
                           DayNum = i + 1),
                       error = function(e) NULL)
           )    

roll_season_team_stats <- do.call(rbind, df_list)

Online Demo

查看更多
走好不送
3楼-- · 2020-04-19 05:53

If I understand correctly, the OP wants to compute the cumulative mean of some variables for each team and season "showing the team averages over time".

Although the OP uses the term "roll", e.g., roll_stats or roll_season_team_stats, his code suggests that he is not after a rolling mean but wants to compute cumulative means from the first DayNum on, e.g.:

stats <- stats_DT[DayNum < i]

However, cumulative means can be calculated directly without creating the result piecewise in a for loop or by lapply() and combining the pieces afterwards.

Unfortunately, the sample dataset provided by the OP does contain rows for many different teams but no history, i.e., no data for the same team for a number of consecutive days. Therefore, I have modified the sample dataset for demonstration:

# create new sample data set
stats_DT2 <- copy(stats_DT)[, TeamID := c(1:2, 1:4, 1:4)][]
stats_DT2
    Season DayNum TeamID FGM FGA FGM3 FGA3 FTM FTA OR DR TO
 1:   2003     10      1  27  58    3   14  11  18 14 24 23
 2:   2003     10      2  26  62    8   20  10  19 15 28 13
 3:   2003     11      1  24  58    8   18  17  29 17 26 10
 4:   2003     11      2  18  38    3    9  17  31  6 19 12
 5:   2003     11      3  30  61    6   14  11  13 17 22 14
 6:   2003     11      4  26  57    6   12  23  27 12 24  9
 7:   2003     12      1  23  55    2    8  32  39 13 18 17
 8:   2003     12      2  28  62    4   14  15  21 13 35 19
 9:   2003     12      3  28  58    5   11  10  18  9 22 17
10:   2003     12      4  32  67    5   17  15  19 14 22  6

Now, as there are 2 to 3 rows for each team, the cumulative means can be calculated by:

# define function for cummulative mean
cummean <- function(x) cumsum(x) / seq_along(x)
# define variables to compute on
cols <- c('FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'TO')
# compute aggregates 
stats_DT2[order(DayNum), c(.(DayNum = DayNum), lapply(.SD, cummean)), 
          .SDcols = cols, by = .(TeamID, Season)][]
    TeamID Season DayNum   FGM  FGA  FGM3  FGA3  FTM   FTA    OR    DR    TO
 1:      1   2003     10 27.00 58.0 3.000 14.00 11.0 18.00 14.00 24.00 23.00
 2:      1   2003     11 25.50 58.0 5.500 16.00 14.0 23.50 15.50 25.00 16.50
 3:      1   2003     12 24.67 57.0 4.333 13.33 20.0 28.67 14.67 22.67 16.67
 4:      2   2003     10 26.00 62.0 8.000 20.00 10.0 19.00 15.00 28.00 13.00
 5:      2   2003     11 22.00 50.0 5.500 14.50 13.5 25.00 10.50 23.50 12.50
 6:      2   2003     12 24.00 54.0 5.000 14.33 14.0 23.67 11.33 27.33 14.67
 7:      3   2003     11 30.00 61.0 6.000 14.00 11.0 13.00 17.00 22.00 14.00
 8:      3   2003     12 29.00 59.5 5.500 12.50 10.5 15.50 13.00 22.00 15.50
 9:      4   2003     11 26.00 57.0 6.000 12.00 23.0 27.00 12.00 24.00  9.00
10:      4   2003     12 29.00 62.0 5.500 14.50 19.0 23.00 13.00 23.00  7.50

Alternatively, the cumulative means can be appended:

# append cumulative columns
stats_DT2[order(DayNum), paste0("cm_", cols) := lapply(.SD, cummean), 
          .SDcols = cols, by = .(TeamID, Season)][]
    Season DayNum TeamID FGM FGA FGM3 FGA3 FTM FTA OR DR TO cm_FGM cm_FGA cm_FGM3 cm_FGA3 cm_FTM cm_FTA cm_OR cm_DR cm_TO
 1:   2003     10      1  27  58    3   14  11  18 14 24 23  27.00   58.0   3.000   14.00   11.0  18.00 14.00 24.00 23.00
 2:   2003     10      2  26  62    8   20  10  19 15 28 13  26.00   62.0   8.000   20.00   10.0  19.00 15.00 28.00 13.00
 3:   2003     11      1  24  58    8   18  17  29 17 26 10  25.50   58.0   5.500   16.00   14.0  23.50 15.50 25.00 16.50
 4:   2003     11      2  18  38    3    9  17  31  6 19 12  22.00   50.0   5.500   14.50   13.5  25.00 10.50 23.50 12.50
 5:   2003     11      3  30  61    6   14  11  13 17 22 14  30.00   61.0   6.000   14.00   11.0  13.00 17.00 22.00 14.00
 6:   2003     11      4  26  57    6   12  23  27 12 24  9  26.00   57.0   6.000   12.00   23.0  27.00 12.00 24.00  9.00
 7:   2003     12      1  23  55    2    8  32  39 13 18 17  24.67   57.0   4.333   13.33   20.0  28.67 14.67 22.67 16.67
 8:   2003     12      2  28  62    4   14  15  21 13 35 19  24.00   54.0   5.000   14.33   14.0  23.67 11.33 27.33 14.67
 9:   2003     12      3  28  58    5   11  10  18  9 22 17  29.00   59.5   5.500   12.50   10.5  15.50 13.00 22.00 15.50
10:   2003     12      4  32  67    5   17  15  19 14 22  6  29.00   62.0   5.500   14.50   19.0  23.00 13.00 23.00  7.50
查看更多
登录 后发表回答