I would like to subset my data based on multiple inequality conditions using the data.table package. The examples in the data.table manual show how to do this with character variables, but not with numeric inequalities. I also see how to do this using the subset function. But I really would like to take advantage of the data.table binary search speed. Below is an example of what I am trying to do.
library(data.table)
data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
data
setkey(data, X, Y, Z)
#the data.frame way
data[X > 0 & Y > 0 & Z > 0]
#the data.table way (does not work as I expected)
data[J(>0, >0, >0)]
The solution is quite fast and straightforward using the package dplyr.
install.packages(dplyr)
library(dplyr)
newdata <- filter(data, X > 0 , Y > 0 , Z > 0)
dplyr is showing to be one of the easiest and fastest packages for managing data frames. Check this great tutorial here: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
The RStudio team have alsoe produced a nice Cheat Sheet, here: http://www.rstudio.com/resources/cheatsheets/
I run some benchmarks
library(dplyr)
library(data.table)
library(microbenchmark)
dt.data.frame.way <- function(data) data[X > 0 & Y > 0 & Z > 0]
dplyr.way <- function(df) filter(df, X > 0, Y > 0, Z > 0)
real.data.frame.way <- function(df) df[df$X > 0 & df$Y > 0 & df$Z > 0,]
data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
setkey(data, X, Y, Z)
df <- as.data.frame(data)
microbenchmark(times = 10,
dt.data.frame.way(data),
dplyr.way(df),
real.data.frame.way(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# dt.data.frame.way(data) 710.426 754.287 871.8784 824.7565 942.998 1180.458 10
# dplyr.way(df) 951.309 1045.246 12303.3462 1142.7440 1246.668 112775.934 10
# real.data.frame.way(df) 137.239 162.591 181.5254 187.9785 197.373 231.594 10
Simple clone example data to 5.5M rows.
data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
data <- rbindlist(lapply(1:5e5, function(i) data)) # 5500000 rows
setkey(data, X, Y, Z)
df <- as.data.frame(data)
microbenchmark(times = 10,
dt.data.frame.way(data),
dplyr.way(df),
real.data.frame.way(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt.data.frame.way(data) 656.2978 668.0560 730.9246 696.6560 831.0877 846.0517 10
# dplyr.way(df) 632.4096 639.1141 709.4308 678.9436 717.3018 1015.7663 10
# real.data.frame.way(df) 964.4298 1022.1772 1075.8448 1077.4437 1125.0037 1192.7410 10
Performance of that task seems to be hard to improve. Often it depends on the data.