-->

Extract row corresponding to minimum value of a va

2018-12-31 12:31发布

问题:

I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extract the entire row.

(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can\'t get it.

Here is a sample data set:

> data
  State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42

data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L), .Label = c(\"AK\", \"RI\"), class = \"factor\"), Company = structure(1:8, .Label = c(\"A\", 
        \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"H\"), class = \"factor\"), Employees = c(82L, 
        104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c(\"State\", \"Company\", 
        \"Employees\"), class = \"data.frame\", row.names = c(NA, -8L))

Calculate min by group is easy, using aggregate:

> aggregate(Employees ~ State, data, function(x) min(x))
  State Employees
1    AK        24
2    RI        19

...or data.table:

> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
   State Employees
1:    AK        24
2:    RI        19

But how do I extract the entire row corresponding to these min values, i.e. also including Company in the result?

回答1:

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you\'d like to subset all of them.

See also Subset by group with data.table.



回答2:

A dplyr solution :

library(dplyr)    
data %>% 
    group_by(State) %>% 
    slice(which.min(Employees))


回答3:

As this is Google\'s top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

Either using data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = \"State\")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from \'order(...)\' to \'forder(DT, ...)\'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from \'.SD[1L]\' to \'list(Company[1L], Employees[1L])\'
# GForce optimized j to \'list(`g[`(Company, 1L), `g[`(Employees, 1L))\'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Or dplyr

library(dplyr)
data %>% 
  arrange(Employees) %>% 
  distinct(State, .keep_all = TRUE)
#   State Company Employees
# 1    RI       E        19
# 2    AK       D        24

Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = \"first\" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = \"first\"]
#    State Company Employees
# 1:    AK       D        24
# 2:    RI       E        19

Some benchmarks

library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, \'[A-Z]\'),
                   Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)

microbenchmark(\"(data.table) .SD[which.min]: \" = DT[ , .SD[which.min(Employees)], by = State],
               \"(data.table) .I[which.min]: \" = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
               \"(data.table) order/unique: \" = unique(DT[order(Employees)], by = \"State\"),
               \"(data.table) order/.SD[1L]: \" = DT[order(Employees), .SD[1L], by = State],
               \"(data.table) self join (on):\" = {
                 tmp <- DT[, .(Employees = min(Employees)), by = State]
                 DT[tmp, on = .(State, Employees), mult = \"first\"]},
               \"(data.table) self join (setkey):\" = {
                 tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                 setkey(tmp, State, Employees)
                 setkey(DT2, State, Employees)
                 DT2[tmp, mult = \"first\"]},
               \"(dplyr) slice(which.min): \" = data %>% group_by(State) %>% slice(which.min(Employees)),
               \"(dplyr) arrange/distinct: \" = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
               \"(dplyr) arrange/group_by/slice: \" = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
               \"(plyr) ddply/which.min: \" = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
               \"(base) by: \" = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))


# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h


回答4:

The base function by is often useful for working with block data in data.frames. For example

by(data, data$State, function(x) x[which.min(x$Employees), ] )

It does return the data in a list, but you can collapse that with

do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))


回答5:

Corrected plyr solution:

ddply(df, .(State), function(x) x[which.min(x$Employees),])
#   State Company Employees
# 1    AK       D        24
# 2    RI       E        19

thanks to @joel.wilson