-->

Creating a running counting variable in R?

2019-06-15 09:37发布

问题:

I have a dataset of soccer match results, and I am hoping to learn R by creating a running set of ratings similar to the World Football Elo formula. I am running into trouble with things that seem to be simple in Excel aren't exactly intuitive in R. For instance, the first 15 of 4270 observations with the necessary variables:

       date t.1  t.2 m.result
1  19960406  DC   SJ      0.0
2  19960413 COL   KC      0.0
3  19960413  NE   TB      0.0
4  19960413 CLB   DC      1.0
5  19960413 LAG NYRB      1.0
6  19960414 FCD   SJ      0.5
7  19960418 FCD   KC      1.0
8  19960420  NE NYRB      1.0
9  19960420  DC  LAG      0.0
10 19960420 CLB   TB      0.0
11 19960421 COL  FCD      1.0
12 19960421  SJ   KC      0.5
13 19960427 CLB NYRB      1.0
14 19960427  DC   NE      0.5
15 19960428 FCD   TB      1.0

I want to be able to create a new variable that will be a running count of t.1 and t.2's total matches played (i.e., the instances up to the date in question that "DC" occurs in columns t.1 or t.2):

           date t.1  t.2 m.result  ##t.1m    ##t.2m
    1  19960406  DC   SJ      0.0       1         1
    2  19960413 COL   KC      0.0       1         1
    3  19960413  NE   TB      0.0       1         1
    4  19960413 CLB   DC      1.0       1         2
    5  19960413 LAG NYRB      1.0       1         1
    6  19960414 FCD   SJ      0.5       1         2
    7  19960418 FCD   KC      1.0       2         2
    8  19960420  NE NYRB      1.0       2         2
    9  19960420  DC  LAG      0.0       3         2
    10 19960420 CLB   TB      0.0       2         2
    11 19960421 COL  FCD      1.0       2         3
    12 19960421  SJ   KC      0.5       3         3
    13 19960427 CLB NYRB      1.0       3         3
    14 19960427  DC   NE      0.5       4         3
    15 19960428 FCD   TB      1.0       4         3

in Excel, this is a (relatively) simple =SUMPRODUCT equation, e.g:

E4=SUMPRODUCT((A:A<=A4)*(B:B=B4))+SUMPRODUCT((A:A<=A4)*(C:C=B4))

where E4 is t.1m for obs # 4, A:A is Date, B:B is t.1, C:C is t.2, etc.

But in R, I can get total sumproduct printed for me (i.e. "DC" has played 576 games across my dataset), but for some reason (probably that I'm new, impatient, rattled by trial and error) I'm just lost on how to make a running count on observation data, and especially how to make that running count into a variable, which is vital for any game rating index. I know 'PlayerRatings' exists, I feel that for my R education I should be able do this in the R suite without that package. plyr or dplyr is okay, of course.

For reference, here is my data for you to copy/paste into your R.

date<-c(19960406,19960413,19960413,19960413,19960413,19960414,19960418,19960420,19960420,19960420,19960421,19960421,19960427,19960427,19960428)
t.1<-c("DC","COL","NE","CLB","LAG","FCD","FCD","NE","DC","CLB","COL","SJ","CLB","DC","FCD")
t.2<-c("SJ","KC","TB","DC","NYRB","SJ","KC","NYRB","LAG","TB","FCD","KC","NYRB","NE","TB")
m.result<-c(0.0,0.0,0.0,1.0,1.0,0.5,1.0,1.0,0.0,0.0,1.0,0.5,1.0,0.5,1.0)
mtable<-data.frame(date,t.1,t.2,m.result)
mtable

回答1:

Here's a very straightforward solution that isn't pretty but does the job.

First, just a change to your data to make comparisons easier:

mtable<-data.frame(date,t.1,t.2,m.result, stringsAsFactors = FALSE)

Edited in:

If you want to assure the matches are ordered by date, you can use order as pointed out by @eipi10:

mtable = mtable[order(mtable$date), ]

Just note that in case the dates are in a format that the chronological order isn't the integer order, you can first convert them to Date format using as.Date().


What we are going to do is, for each row, take a subset of the dataframe with the columns t.1 and t.2, with all the rows from 1 to the said row. So 1:1, 1:2, 1:3, etc. At each run, we count the number of times that team has appeared, and use that as the result for the new column.

mtable$t.1m <- sapply(1:nrow(mtable),
             function(i) sum(mtable[1:i, c("t.1", "t.2")] == mtable$t.1[i]))

This was done for teams in t.1, with a small change on argument after ==we can make it for t.2:

mtable$t.2m <- sapply(1:nrow(mtable),
             function(i) sum(mtable[1:i, c("t.1", "t.2")] == mtable$t.2[i]))

Now our dataframe looks like this:

> mtable
       date t.1  t.2 m.result t.1m t.2m
1  19960406  DC   SJ      0.0    1    1
2  19960413 COL   KC      0.0    1    1
3  19960413  NE   TB      0.0    1    1
4  19960413 CLB   DC      1.0    1    2
5  19960413 LAG NYRB      1.0    1    1
6  19960414 FCD   SJ      0.5    1    2
7  19960418 FCD   KC      1.0    2    2
8  19960420  NE NYRB      1.0    2    2
9  19960420  DC  LAG      0.0    3    2
10 19960420 CLB   TB      0.0    2    2
11 19960421 COL  FCD      1.0    2    3
12 19960421  SJ   KC      0.5    3    3
13 19960427 CLB NYRB      1.0    3    3
14 19960427  DC   NE      0.5    4    3
15 19960428 FCD   TB      1.0    4    3


回答2:

In your data creation step, make sure stringsAsFactors = FALSE to avoid issues. Then it's easy to do. (edit: I made this an all dplyr example)

library(dplyr)

cross_count <- function(id, var) {
  length(which(mtable[id, var] == mtable[1:id, ] %>% select(t.1, t.2) %>% unlist))
}

mtable  %>% 
  arrange(date) %>% # This makes sure the dates are in order
  mutate(id = 1:nrow(.)) %>% 
  rowwise() %>% 
  mutate(t.1m = cross_count(id, 2), t.2m = cross_count(id, 3))




 date t.1  t.2 m.result id t.1m t.2m
1  19960406  DC   SJ      0.0  1    1    1
2  19960413 COL   KC      0.0  2    1    1
3  19960413  NE   TB      0.0  3    1    1
4  19960413 CLB   DC      1.0  4    1    2
5  19960413 LAG NYRB      1.0  5    1    1
6  19960414 FCD   SJ      0.5  6    1    2
7  19960418 FCD   KC      1.0  7    2    2
8  19960420  NE NYRB      1.0  8    2    2
9  19960420  DC  LAG      0.0  9    3    2
10 19960420 CLB   TB      0.0 10    2    2
11 19960421 COL  FCD      1.0 11    2    3
12 19960421  SJ   KC      0.5 12    3    3
13 19960427 CLB NYRB      1.0 13    3    3
14 19960427  DC   NE      0.5 14    4    3
15 19960428 FCD   TB      1.0 15    4    3


回答3:

It seems like the separate columns t.1m and t.2m are for book-keeping, and you're really just interested in the number of games played? I used with() to work with the columns of mtable without having to write mtable each time

mtable$games <- with(mtable, {

If a particular team plays, it plays as either team 1 or team 2

    played <- t.1 == "DC" | t.2 == "DC"

The comparison is vectorized, comparing each element of column t.1 with "DC", etc, and the logical comparison is also vectorized so a single |.

A tricky part of the data is that several teams play on a single day, and on the day of the match (apparently) only the focal team should be incremented. I accommodated this by figuring out how to order the games so that the focal team was always last in order on the day it played

    o <- order(date, played)

then calculated the cumulative sum of games played

    games <- cumsum(played[o])

and put the games back into their original order

    games[order(o)]
})

Here's the result

> head(mtable, 11)
       date t.1  t.2 m.result games
1  19960406  DC   SJ      0.0     1
2  19960413 COL   KC      0.0     1
3  19960413  NE   TB      0.0     1
4  19960413 CLB   DC      1.0     2
5  19960413 LAG NYRB      1.0     1
6  19960414 FCD   SJ      0.5     2
7  19960418 FCD   KC      1.0     2
8  19960420  NE NYRB      1.0     2
9  19960420  DC  LAG      0.0     3
10 19960420 CLB   TB      0.0     2
11 19960421 COL  FCD      1.0     3

Here's a function that implements this, allowing easy specification of the focal team

gamesplayed <- function(date, t1, t2, focal="DC") {
    played <- t1 == focal | t2 == focal
    o <- order(date, played)
    cumsum(played[o])[order(o)]
}


回答4:

To accomplish this using a similar fashion as you mentioned:

sum(mtable$t.1 == 'DC', mtable$t.2 == 'DC')