Dropping factors which levels have observations sm

2019-03-04 17:08发布

问题:

Let I have such data frame(df1) with factors:

factor1  factor2  factor3
-------  -------  -------
d        a         x
d        a         x
b        a         x
b        c         x
b        c         y
c        c         y
c        n         y
c        n         y
c        n         y

I want to drop factors from this data frame which one of elements have less than 3 observations.

In this data frame factor1 has 3 levels(d,b and c). However d level has frequency 2. So I want to drop factor1 from this data frame.

Resulted data frame should be as:

factor2  factor3
-------  -------
a         x
a         x
a         x
c         x
c         y
c         y
n         y
n         y
n         y

How can I do this using R? I will be very glad for any help. Thanks a lot.

回答1:

You could try using lapply and table:

df1[, lapply(c(1,2,3), FUN = function(x) min(table(df1[,x]))) >= 3]

and, a little more generic:

df1[, lapply(1:ncol(df1), FUN = function(x) min(table(df1[,x]))) >= 3]


回答2:

is that what you want?

df <- data.frame(col1=rep(letters[1:4], each=3),
                 col2=rep(letters[1:2], each=6),
                 col3=rep(letters[1:3], each=4))

ddf[, sapply(df, function(x) min(nlevels(x)) > 2)]


回答3:

We could use Filter

Filter(function(x) min(nlevels(x))>2, df1)

(based on the results in one of the upvoted posts)

Or it could be also

Filter(function(x) min(tabulate(x))>2, df1)


回答4:

I would create a quick helper function that checks how many unique instances of each level exist with a quick call to table() -- look at table(df$fac1) to see how this works. Note this isn't very robust, but should get you started:

df <- data.frame(fac1 = factor(c("d", "d", "b", "b", "b", "c", "c", "c", "c")),
                 fac2 = factor(c("a", "a", "a", "c", "c", "c", "n", "n", "n")),
                 fac3 = factor(c(rep("x", 4), rep("y", 5))),
                 other = 1:9)

at_least_three_instances <- function(column) {
  if (is.factor(column)) {
    if (min(table(column)) > 2) {
      return(TRUE)
    } else {
      return(FALSE)
    }
  } else {
    return(TRUE)
  }
}

df[unlist(lapply(df, at_least_three_instances))]