I am working with a data frame that deals with numeric measurements. Some individuals have been measured several times, both as juveniles and adults.
A reproducible example:
ID <- c("a1", "a2", "a3", "a4", "a1", "a2", "a5", "a6", "a1", "a3")
age <- rep(c("juvenile", "adult"), each=5)
size <- rnorm(10)
# e.g. a1 is measured 3 times, twice as a juvenile, once as an adult.
d <- data.frame(ID, age, size)
My goal is to subset that data frame by selecting the IDs that appear at least once as a juvenile and at least once as an adult. Not sure how to do that..?
The resulting dataframe would contain all measurements for individuals a1, a2 and a3, but would exclude a4, a5 and a6, as they were not measured at both stages.
A similar question was asked 7 months ago but never had an answer (Subset data frame to include only levels one factor that have values in both levels of another factor)
Thanks!
Here is one option with data.table
library(data.table)
setDT(d)[, .SD[all(c("juvenile", "adult") %in% age)], ID]
Or a base R
option with ave
d[with(d, ave(as.character(age), ID, FUN = function(x) length(unique(x)))>1),]
# ID age size
#1 a1 juvenile -1.4545407
#2 a2 juvenile -0.4695317
#3 a3 juvenile 0.2271316
#5 a1 juvenile 0.2961210
#6 a2 adult -0.8331993
#9 a1 adult -0.6924967
#10 a3 adult -0.4619550
With dplyr
, you can use group_by %>% filter
:
library(dplyr)
d %>% group_by(ID) %>% filter(all(c("juvenile", "adult") %in% age))
# A tibble: 7 x 3
# Groups: ID [3]
# ID age size
# <fctr> <fctr> <dbl>
#1 a1 juvenile -0.6947697
#2 a2 juvenile -0.3665272
#3 a3 juvenile 1.0293555
#4 a1 juvenile 0.2745224
#5 a2 adult 0.5299029
#6 a1 adult 2.2247802
#7 a3 adult -0.4717160
split
by age
, intersect
and subset:
d[d$ID %in% Reduce(intersect, split(d$ID, d$age)),]
# ID age size
#1 a1 juvenile 1.44761836
#2 a2 juvenile 1.70098645
#3 a3 juvenile 0.08231986
#5 a1 juvenile 0.91240568
#6 a2 adult -1.77318962
#9 a1 adult 0.13597986
#10 a3 adult -1.18575294