I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply
, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
Don't use
table()
- very slow for such things. One option islength(unique(x))
:Which is an order magnitude faster than yours on the example data set whilst giving similar output:
Simon's solution here is similarly quick on this example:
but you'll have to see if they scale similarly to real problem sizes.
Well, save yourself some coding time:
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like
bar[bar< 2*.Machine$double.eps] <- 0
and then finally your data framedat[,as.logical(bar)]
should do the trick.Use the
Caret
Package and the functionnearZeroVar
You may also want to look into the
nearZeroVar()
function in the caret package.If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model).
nearZeroVar()
can do that.Simply don't use
table
- it's extremely slow on numeric vectors since it converts them to strings. I would probably use something likeIt will be
TRUE
for 0-variance,NA
for columns with NAs andFALSE
for non-zero varianceI think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this: