I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id x y z
1: 1 109.3400 208.6732 308.7595
2: 2 101.6920 201.0989 310.1080
3: 3 119.4697 217.8550 313.9384
4: 4 111.4261 205.2945 317.3651
5: 5 100.4024 212.2826 305.1375
6: 6 114.4711 203.6988 319.4913
in several stages. I am aware that I could apply subset(.)
sequentially to achieve this.
> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
id x y z
1: 55 119.2634 219.0044 315.6556
My problem is that I need to automate this and it might happen that the subset is empty. In this case, I would want to skip the step(s) that result in an empty set. For example, if my data was
dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
id x y z
1: 3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567
the second step subset(s, y>219)
would come up empty but I would still want to apply the third step subset(s,z>315)
. Is there a way to apply a subset-command only if it results in a non-empty set? I imagine something like subset(s, y>219, nonzero=TRUE)
. I would want to avoid constructions like
s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.)
. That's why I am hoping to find a solution optimized for speed.
PS. I only chose subset(.)
for clarity, solutions with e.g. data.table would be just as welcome if not more so.
An interesting approach could be developed using modified
filter
function offered indplyr
. In case of conditions not being met thenon_empty_filter
filter function returns original data set.Notes
warning
. Of course, this can be removed and has no bearing on the function results.Function
Condition met
Behaviour: Returning one row for which the condition is met.
Results
Condition not met
Behaviour: Returning the full data set as the whole condition is not met due to
y > 1e6
.Results
Condition met/not met one-by-one
Behaviour: Skipping filter that would return an empty data set.
Results
I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):
Usage
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with
f(dat, x == 119, verbose=TRUE)
, I see it.If it's for non-interactive use, maybe better to have the function return
list(mon = mon, x = x)
to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.