I have a large data.table
that I am collapsing to the month level using ,by
.
There are 5 by vars, with # of levels: c(4,3,106,3,1380)
. The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by
drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
Make a cartesian join of the unique values, and use that to join back to your results
Note that the missing values are NA right now, but you can easily change that to 0s if you want.
I'd also go with a cross-join, but would use it in the
i
-slot of the original call to[.data.table
:This approach is referred to as a "by-without-by" and, as documented in
?data.table
, it is just as efficient and fast as passing the grouping instructions in via theby
argument: