How to get rows, by group, of data frame with earl

2020-02-07 10:17发布

问题:

df <- data.frame(group=c(1,2,4,2,1,4,2,3,3),
             ts=c("2014-02-13","2014-06-01","2014-02-14","2014-02-11","2013-02-01","2014-02-02","2014-03-21","2014-12-01","2014-02-11"),
             letter=letters[1:9])
df$ts <- as.Date(df$ts,format='%Y-%m-%d')

I want to find an operation that will produce the complete rows containing the minimum timestamp per group, in this case,

group         ts letter
    1 2013-02-01      e
    4 2014-02-02      f
    2 2014-02-11      d
    3 2014-02-11      i

A quick and dirty (and slow) base R solution would be

dfo <- data.frame(df[order(df$ts,decreasing=F),],index=seq(1:nrow(df)))
mins <- tapply(dfo$index,dfo$group,min)
dfo[dfo$index %in% mins,]

Intuitively, I think if there was a way to add an order index by group then I could just filter to where that column's value is 1, but I'm not sure how to execute it without lots of subsetting and rejoining.

回答1:

You could use dplyr

library(dplyr)
group_by(df, group) %>% summarise(min = min(ts), letter = letter[which.min(ts)]) 
#   group        min letter
# 1     1 2013-02-01      e
# 2     2 2014-02-11      d
# 3     3 2014-02-11      i
# 4     4 2014-02-02      f

You could also slice the ranked rows

group_by(df, group) %>% 
    mutate(rank = row_number(ts)) %>% 
    arrange(rank) %>%
    slice(1)


回答2:

Here's a data.table solution. You seem to want the result orders by ts, not group. THis does that.

library(data.table)
setDT(df)[,.SD[which.min(ts)],by=group][order(ts)]
#    group         ts letter
# 1:     1 2013-02-01      e
# 2:     4 2014-02-02      f
# 3:     2 2014-02-11      d
# 4:     3 2014-02-11      i


回答3:

Here's a one-liner using base R.

df[sapply(split(df,df$group), function(x) row.names(x)[which.min(x$ts)] ),]

Breaking it down some:

list.by.group <- split(df,df$group)
#a vector of the row names corresponding to the earliest date in each group
names.of.which.min <- sapply(list.by.group, function(x) row.names(x)[which.min(x$ts)])
#subset the data frame by row name
df[names.of.which.min,]


标签: r dplyr