Elegant way to drop rare factor levels from data f

2019-04-06 18:18发布

问题:

I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

This code creates data frame:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

But do quicker and prettier solutions exists?

回答1:

What about

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]


回答2:

require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317


回答3:

library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]

##    factor         V1
## 1:      a -0.8204684
## 2:      a  0.4874291
## 3:      a  0.7383247
## 4:      a  0.5757814
## 5:      a -0.3053884
## 6:      b  1.5117812
## 7:      b  0.3898432
## 8:      b -0.6212406
## 9:      b -2.2146999
## 10:     b  1.1249309


回答4:

Maybe join with a filtered count of the factors:

library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
df.1 <- semi_join(df, common.factors)


回答5:

Try this with base functions...

lvl = as.data.frame(table(df$factor))
colnames(lvl) = c('factor','count')
lvl
  factor count
1      a     5
2      b     5
3      c     2

df[df$factor %in% lvl[lvl$count>=5,]$factor,]
   factor    variable
1       a -0.01619026
2       a  0.94383621
3       a  0.82122120
4       a  0.59390132
5       a  0.91897737
6       b  0.78213630
7       b  0.07456498
8       b -1.98935170
9       b  0.61982575
10      b -0.05612874


回答6:

This worked for me:

df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]


标签: r subset