I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.
df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))
This code creates data frame:
factor variable
1 a -1.55902013
2 a 0.22355431
3 a -1.52195456
4 a -0.32842689
5 a 0.85650212
6 b 0.00962240
7 b -0.06621508
8 b -1.41347823
9 b 0.08969098
10 b 1.31565582
11 c -1.26141417
12 c -0.33364069
And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:
for (i in 1:length(levels(df$factor))){
if(table(df$factor)[i] < 5){
df.new <- df[df$factor != names(table(df$factor))[i],]
}
}
But do quicker and prettier solutions exists?
What about
df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]
require(dplyr)
df %>% group_by(factor) %>% filter(n() >= 5)
#factor variable
#1 a 2.0769363
#2 a 0.6187513
#3 a 0.2426108
#4 a -0.4279296
#5 a 0.2270024
#6 b -0.6839748
#7 b -0.3285610
#8 b 0.2625743
#9 b -0.9532957
#10 b 1.4526317
library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]
## factor V1
## 1: a -0.8204684
## 2: a 0.4874291
## 3: a 0.7383247
## 4: a 0.5757814
## 5: a -0.3053884
## 6: b 1.5117812
## 7: b 0.3898432
## 8: b -0.6212406
## 9: b -2.2146999
## 10: b 1.1249309
Maybe join with a filtered count of the factors:
library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5)
df.1 <- semi_join(df, common.factors)
Try this with base functions...
lvl = as.data.frame(table(df$factor))
colnames(lvl) = c('factor','count')
lvl
factor count
1 a 5
2 b 5
3 c 2
df[df$factor %in% lvl[lvl$count>=5,]$factor,]
factor variable
1 a -0.01619026
2 a 0.94383621
3 a 0.82122120
4 a 0.59390132
5 a 0.91897737
6 b 0.78213630
7 b 0.07456498
8 b -1.98935170
9 b 0.61982575
10 b -0.05612874
This worked for me:
df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]