R - Keep first observation per group identified by

2019-08-30 07:46发布

问题:

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:

id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)

Should be transformed to:

   id day value   
   1   1    10 
   1   2    15 
   1   3    20 
   2   1    40 
   2   2    30 
   3   2    22 
   3   3    24 
   4   1    11 
   4   2    11 
   4   3    12 

By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).

In Stata, this would simply be:

bys id day (value): keep if _n == 1

I found a piece of code on the web, which properly does that if I first produce a single group identifier :

mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="")  ### the single group identifier

myid.uni <- unique(mydata$id1)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
  temp<-subset(mydata, id1==myid.uni[i])
  if (dim(temp)[1] > 1) {
    last.temp<-temp[dim(temp)[1],]
  }
  else {
    last.temp<-temp
  }
  last<-rbind(last, last.temp)
}

last

However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.

Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?

回答1:

I would order the data.frame at which point you can look into using by:

mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata$id, mydata$day), 
                  FUN=function(x) head(x, 1)))

Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
#     id day value
#  1:  1   1    10
#  2:  1   2    15
#  3:  1   3    20
#  4:  2   1    40
#  5:  2   2    30
#  6:  3   2    22
#  7:  3   3    24
#  8:  4   1    11
#  9:  4   2    11
# 10:  4   3    12

Or, starting from scratch, you can use data.table in the following way:

DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]

And, by extension, in base R:

Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
  rank(x, ties.method="first")))
mydata[Ranks == 1, ]


回答2:

The package dplyr makes this kind of things easier.

library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)

This command requires more memory in R than in Stata: rows are not suppressed in place, a new copy of the dataset is created.



回答3:

Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:

library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]

Using dplyr with magrittr pipes:

library(dplyr)
mydata <- mydata %>%
  group_by(id, day) %>%
  slice(1) %>%
  ungroup()

If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.



标签: r stata