I wish to (1) group data by one variable (State
), (2) within each group find the row of minimum value of another variable (Employees
), and (3) extract the entire row.
(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can\'t get it.
Here is a sample data set:
> data
State Company Employees
1 AK A 82
2 AK B 104
3 AK C 37
4 AK D 24
5 RI E 19
6 RI F 118
7 RI G 88
8 RI H 42
data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c(\"AK\", \"RI\"), class = \"factor\"), Company = structure(1:8, .Label = c(\"A\",
\"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"H\"), class = \"factor\"), Employees = c(82L,
104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c(\"State\", \"Company\",
\"Employees\"), class = \"data.frame\", row.names = c(NA, -8L))
Calculate min
by group is easy, using aggregate
:
> aggregate(Employees ~ State, data, function(x) min(x))
State Employees
1 AK 24
2 RI 19
...or data.table
:
> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
State Employees
1: AK 24
2: RI 19
But how do I extract the entire row corresponding to these min
values, i.e. also including Company
in the result?
Slightly more elegant:
library(data.table)
DT[ , .SD[which.min(Employees)], by = State]
State Company Employees
1: AK D 24
2: RI E 19
Slighly less elegant than using .SD
, but a bit faster (for data with many groups):
DT[DT[ , .I[which.min(Employees)], by = State]$V1]
Also, just replace the expression which.min(Employees)
with Employees == min(Employees)
, if your data set has multiple identical min values and you\'d like to subset all of them.
See also Subset by group with data.table.
A dplyr
solution :
library(dplyr)
data %>%
group_by(State) %>%
slice(which.min(Employees))
As this is Google\'s top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees
and then just take the uniques per State
Either using data.table
library(data.table)
unique(setDT(data)[order(Employees)], by = \"State\")
# State Company Employees
# 1: RI E 19
# 2: AK D 24
Alternatively, we could also first order and then subset .SD
. Both of those operations were optimized in the resent data.table versions and order
is seemingly triggers data.table:::forderv
, while .SD[1L]
triggers Gforce
setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from \'order(...)\' to \'forder(DT, ...)\'.
# i clause present and columns used in by detected, only these subset: State
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from \'.SD[1L]\' to \'list(Company[1L], Employees[1L])\'
# GForce optimized j to \'list(`g[`(Company, 1L), `g[`(Employees, 1L))\'
# Making each group and running j (GForce TRUE) ... 0 secs
# State Company Employees
# 1: RI E 19
# 2: AK D 24
Or dplyr
library(dplyr)
data %>%
arrange(Employees) %>%
distinct(State, .keep_all = TRUE)
# State Company Employees
# 1 RI E 19
# 2 AK D 24
Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = \"first\"
in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin
function (which skips the evaluation overhead) and the binary join feature
tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = \"first\"]
# State Company Employees
# 1: AK D 24
# 2: RI E 19
Some benchmarks
library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)
set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, \'[A-Z]\'),
Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)
microbenchmark(\"(data.table) .SD[which.min]: \" = DT[ , .SD[which.min(Employees)], by = State],
\"(data.table) .I[which.min]: \" = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
\"(data.table) order/unique: \" = unique(DT[order(Employees)], by = \"State\"),
\"(data.table) order/.SD[1L]: \" = DT[order(Employees), .SD[1L], by = State],
\"(data.table) self join (on):\" = {
tmp <- DT[, .(Employees = min(Employees)), by = State]
DT[tmp, on = .(State, Employees), mult = \"first\"]},
\"(data.table) self join (setkey):\" = {
tmp <- DT2[, .(Employees = min(Employees)), by = State]
setkey(tmp, State, Employees)
setkey(DT2, State, Employees)
DT2[tmp, mult = \"first\"]},
\"(dplyr) slice(which.min): \" = data %>% group_by(State) %>% slice(which.min(Employees)),
\"(dplyr) arrange/distinct: \" = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
\"(dplyr) arrange/group_by/slice: \" = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
\"(plyr) ddply/which.min: \" = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
\"(base) by: \" = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# (data.table) .SD[which.min]: 119.66086 125.49202 145.57369 129.61172 152.02872 267.5713 100 d
# (data.table) .I[which.min]: 12.84948 13.66673 19.51432 13.97584 15.17900 109.5438 100 a
# (data.table) order/unique: 52.91915 54.63989 64.39212 59.15254 61.71133 177.1248 100 b
# (data.table) order/.SD[1L]: 51.41872 53.22794 58.17123 55.00228 59.00966 145.0341 100 b
# (data.table) self join (on): 44.37256 45.67364 50.32378 46.24578 50.69411 137.4724 100 b
# (data.table) self join (setkey): 14.30543 15.28924 18.63739 15.58667 16.01017 106.0069 100 a
# (dplyr) slice(which.min): 82.60453 83.64146 94.06307 84.82078 90.09772 186.0848 100 c
# (dplyr) arrange/distinct: 344.81603 360.09167 385.52661 379.55676 395.29463 491.3893 100 e
# (dplyr) arrange/group_by/slice: 367.95924 383.52719 414.99081 397.93646 425.92478 557.9553 100 f
# (plyr) ddply/which.min: 506.55354 530.22569 568.99493 552.65068 601.04582 727.9248 100 g
# (base) by: 1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377 100 h
The base function by
is often useful for working with block data in data.frames. For example
by(data, data$State, function(x) x[which.min(x$Employees), ] )
It does return the data in a list, but you can collapse that with
do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))
Corrected plyr
solution:
ddply(df, .(State), function(x) x[which.min(x$Employees),])
# State Company Employees
# 1 AK D 24
# 2 RI E 19
thanks to @joel.wilson