Let's say you have a data frame with two levels of factors that looks like this:
Factor1 Factor2 Value
A 1 0.75
A 1 0.34
A 2 1.21
A 2 0.75
A 2 0.53
B 1 0.42
B 2 0.21
B 2 0.18
B 2 1.42
etc.
How do I subset
this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? Can you use the length
argument in subset
to do this?
Assuming your data.frame
is called mydf
, you can use ave
to create a logical vector to help subset:
mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2,
FUN = function(x) length(x) > 2))), ]
# Factor1 Factor2 Value
# 3 A 2 1.21
# 4 A 2 0.75
# 5 A 2 0.53
# 7 B 2 0.21
# 8 B 2 0.18
# 9 B 2 1.42
Here's ave
counting up your combinations. Notice that ave
returns an object the same length as the number of rows in your data.frame
(this makes it convenient for subsetting).
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"
The next step is to compare that length to your threshold. For that we need an anonymous function for our FUN
argument.
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE" "TRUE" "TRUE" "FALSE" "TRUE" "TRUE" "TRUE"
Almost there... but since the first item was a character vector, our output is also a character vector. We want it as.logical
so we can directly use it for subsetting.
ave
doesn't work on objects of class factor
, in which case you'll need to do something like:
mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2,
FUN = function(x) length(x) > 2))),]
library(data.table)
dt = data.table(your_df)
dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
# Factor1 Factor2 Value
#1: A 2 1.21
#2: A 2 0.75
#3: A 2 0.53
#4: B 2 0.21
#5: B 2 0.18
#6: B 2 1.42
You can use interaction
and table
to see the number of observation for each interaction (mydata is your data) and then use %in%
to subset the data.
mydata$inter<-with(mydata,interaction(Factor1,Factor2))
table(mydata$inter)
A.1 B.1 A.2 B.2
2 1 3 3
mydata[!mydata$inter %in% c("A.1","B.1"), ]
Factor1 Factor2 Value inter
3 A 2 1.21 A.2
4 A 2 0.75 A.2
5 A 2 0.53 A.2
7 B 2 0.21 B.2
8 B 2 0.18 B.2
9 B 2 1.42 B.2
Updated as per @Ananda's comment:You can use following one line code after creating the interaction variable.
mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]