subsetting df with repeated sequences

2019-08-10 21:40发布

问题:

I have searched high and low for a solution to this, but I cannot find one.....

My dataframe (essentially a table of the no. 1 sports team by date) has numerous occasions where one or various teams would "reappear" in the data. I want to pull out the start (or end) date of each period at no. 1 per team.

An example of the data could be:

x1<- as.Date("2013-12-31")
adddate1 <- 1:length(teams1)
dates1 <- x1 + adddate1
teams2 <- c(rep("w", 3), rep("c", 8), rep("w", 4))
x2<- as.Date("2012-12-31")
adddate2 <- 1:length(teams2)
dates2 <- x2 + adddate2
dates <- c(dates2, dates1)
teams <- c(teams2, teams1)
df <- data.frame(dates, teams)
df$year <- year(df$dates)

which for 2013 looks like:

        dates teams year
1  2013-01-01     w 2013
2  2013-01-02     w 2013
3  2013-01-03     w 2013
4  2013-01-04     c 2013
5  2013-01-05     c 2013
6  2013-01-06     c 2013
7  2013-01-07     c 2013
8  2013-01-08     c 2013
9  2013-01-09     c 2013
10 2013-01-10     c 2013
11 2013-01-11     c 2013
12 2013-01-12     w 2013
13 2013-01-13     w 2013
14 2013-01-14     w 2013
15 2013-01-15     w 2013

However, using ddply aggregates the identically-named teams and returns the following:

split <- ddply(df, .(year, teams), head,1)
split <- split[order(split[,1]),]

       dates teams year
2 2013-01-01     w 2013
1 2013-01-04     c 2013
3 2014-01-01     c 2014
4 2014-01-09     k 2014

Is there a more elegant way to do this than creating a function which would go through the original df and return a unique value for each subset, add this to the df and then use ddply incorporating the new unique value to return what I want?

回答1:

You say some teams "reappear" and at that point I thought the little intergroup helper function from this answer might be just the right tool here. It is useful when in your case, there are teams e.g. "w" that reappear in the same year, e.g. 2013, after another team has been there for some time, e.g. "c". Now if you want to treat each sequence of occurence per team as separate groups in order to get the first or last date of that sequence, that when this function is useful. Note that if you only group by "team" and "year" as you would normally do, each team, e.g. "w" could only have one first/last date (for example when using "summarise" in dplyr).

Define the function:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

Now group your data first by year and then additionally by using the intergroup function on the teams column:

library(dplyr)
df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  filter(dense_rank(dates) == 1)

Finally, you can filter according to your needs. Here for example, I filter the min dates. The result would be:

#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
#       dates teams year teamindex
#1 2013-01-01     w 2013         1
#2 2013-01-04     c 2013         2
#3 2013-01-12     w 2013         3

Note that team "w" reappears because we grouped by "teamindex" which we created by using intergroup function.

Another option to do the filtering is like this (using arrange and then slice):

df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  arrange(dates) %>%
  slice(1)

The data I used is from akrun's answer.



回答2:

You could also use rle to create the teamindex.

library(dplyr)
 df %>% 
    group_by(year) %>% 
    group_by(teamindex= with(rle(teams),
          rep(seq_along(lengths), lengths)), add=TRUE) %>%
          filter(dates==min(dates)) #or #filter(dates==max(dates))

 #        dates teams year teamindex
 #1 2013-01-01     w 2013         1
 #2 2013-01-04     c 2013         2
 #3 2013-01-12     w 2013         3

Or

df %>% 
   group_by(year) %>%
   group_by(teamindex= with(rle(teams),
      rep(seq_along(lengths), lengths)), add=TRUE) %>%
   arrange(dates) %>%
   slice(n()) #or #slice(1)
 #       dates teams year teamindex
 #1 2013-01-03     w 2013         1
 #2 2013-01-11     c 2013         2
 #3 2013-01-15     w 2013         3

data

df <- structure(list(dates = structure(c(15706, 15707, 15708, 15709, 
15710, 15711, 15712, 15713, 15714, 15715, 15716, 15717, 15718, 
15719, 15720), class = "Date"), teams = c("w", "w", "w", "c", 
"c", "c", "c", "c", "c", "c", "c", "w", "w", "w", "w"), year = c(2013L, 
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 
2013L, 2013L, 2013L, 2013L, 2013L)), .Names = c("dates", "teams", 
"year"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15"), class = "data.frame")


标签: r subset plyr