While assessing the utility of data.table
(vs. dplyr
), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).
This is the original code:
library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]
This is its dataset-agnostic equivalent:
dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"
strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal"
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work!
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work!
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]
Note (Thanks to reply below): if you need to change variable value by reference, you need to use ()
, not get()
, as shown below:
strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]
Now: you can cut-n-paste the following code for all your looping needs:
for(nVarGroup in 2:4) # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans)))
qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]
print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
print(p)
}
My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of ()
, get()
, quote()
/eval()
, [[]]
). Seems too many for a such straightforward need...
Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=
, lapply
/.SD
/.SDcols
?
Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table
within functions
and loops
.
The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.
The OP has asked for a dataset-agnostic equivalent for grouping and aggregating.
With development version 1.10.5,
data.table
has gained new Grouping Sets functions:rollup()
,cube()
, andgroupingsets()
which allow aggregation on various grouping levels at once producing sub-totals and grand total.The added level of abstraction can be used for a dataset-agnostic approach. The subtotals which are computed using a double nested
for
loop in OP's example can be produced as well bySo, we do not have to know the names of the columns. However, we have to specify which columns to group by and which columns to aggregate.
This might not be the most
data.table
-like or the fastest solution but I would streamline the code in this particular loop as follows:I've left the old code as comment for reference.
qAction
is replaced by usinglapply(.SD, mean)
together with the.SDcols
parameter.qGroup
for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values asi
parameter.In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the
on=
syntax.Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".
Matt suggested to create a helper function
which can be combined with the "quasi-perl type string interpolation of
fn$
from thegsubfn
package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.With this, the code for the loop becomes eventually:
Now, the
data.table
statement looks pretty much like a "native" statement except that$strVarGroup
and$strVarMeans
is used where the contents of variables is referenced.With version 1.1.0 (CRAN release on 2016-08-19), the
stringr
package has gained a string interpolation functionstr_interp()
which is an alternative to thegsubfn
package here.With
str_interp()
, the central statement in the for loop would becomeand the call to
library(gsubfn)
could be removed.