If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.
Thanks in advance for any advice. Code Example:
library(data.table)
library(dplyr)
diamondsDT <- data.table(ggplot2::diamonds)
setkey(diamondsDT, cut)
diamondsDT %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()) %>%
arrange(desc(Count))
Results:
# cut AvgPrice MedianPrice Count
# 1 Ideal 3457.542 1810.0 21551
# 2 Premium 4584.258 3185.0 13791
# 3 Very Good 3981.760 2648.0 12082
# 4 Good 3928.864 3050.5 4906
Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:
diamondsDT [cut != "Fair"
] [, .(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N), by=cut
] [ order(-Count) ]
There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.
Operations involving
i
(==filter()
andslice()
in dplyr)Assume
DT
with say 10 columns. Consider these data.table expressions:(1) gives the number of rows in
DT
where columna > 1
. (2) returnsmean(b)
grouped byc,d
for the same expression ini
as (1).Commonly used
dplyr
expressions would be:Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4),
filter()
returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columnsb, c, d
for the successive operations. To overcome this, we have toselect()
columns apriori:Note that in (5) and (6), we still subset column
a
which we don't require. But I'm not sure how to avoid that. Iffilter()
function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).Sub-assign by reference
For example, in data.table you can do:
which updates column
a
by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.
Other functionalities
In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.
data.table's rolling joins feature is not supported in dplyr's syntax as well.
We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function
foverlaps()
at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).But ultimately, our goal is to integrate it into
[.data.table
so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex:
DT[x == 1]
andDT[x %in% some_vals]
will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.From the way
filter()
is implemented for data.tables, it doesn't take advantage of this feature.A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.
So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.
HTH
(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.
Just try it.
On this problem it seems data.table is 2.4x faster than dplyr using data.table:
Revised based on Polymerase's comment.
To answer your questions:
data.table
data.table
syntaxIn many cases this will be an acceptable compromise for those wanting the
dplyr
syntax, though it will possibly be slower thandplyr
with plain data frames.One big factor appears to be that
dplyr
will copy thedata.table
by default when grouping. Consider (using microbenchmark):The filtering is of comparable speed, but the grouping isn't. I believe the culprit is this line in
dplyr:::grouped_dt
:where
copy
defaults toTRUE
(and can't easily be changed to FALSE that I can see). This likely doesn't account for 100% of the difference, but general overhead alone on something the size ofdiamonds
most likely isn't the full difference.The issue is that in order to have a consistent grammar,
dplyr
does the grouping in two steps. It first sets keys on a copy of the original data table that match the groups, and only later does it group.data.table
just allocates memory for the largest result group, which in this case is just one row, so that makes a big difference in how much memory needs to be allocated.FYI, if anyone cares, I found this by using
treeprof
(install_github("brodieg/treeprof")
), an experimental (and still very much alpha) tree viewer forRprof
output:Note the above is currently only works on macs AFAIK. Also, unfortunately,
Rprof
records calls of the typepackagename::funname
as anonymous so it could actually be any and all of thedatatable::
calls insidegrouped_dt
that are responsible, but from quick testing it looked likedatatable::copy
is the big one.That said, you can quickly see how there isn't that much overhead around the
[.data.table
call, but there is also a completely separate branch for the grouping.EDIT: to confirm the copying: