I thought that generally speaking using %>%
wouldn't have a noticeable effect on speed. But in this case it runs 4x slower.
library(dplyr)
library(microbenchmark)
set.seed(0)
dummy_data <- dplyr::data_frame(
id=floor(runif(100000, 1, 100000))
, label=floor(runif(100000, 1, 4))
)
microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))
Without pipe:
min lq mean median uq max neval
1.691441 1.739436 1.841157 1.812778 1.880713 2.495853 100
With pipe:
min lq mean median uq max neval
6.753999 6.969573 7.167802 7.052744 7.195204 8.833322 100
Why is %>%
so much slower in this situation? Is there a better way to write this?
But here is something I have learnt today. I am using R 3.5.0.
Code with x = 100 (1e2)
Although, if x = 1e6
So, I finally got around to running the expressions in OP's question:
This took so long that I thought I'd run into a bug, and force-interrupted R.
Trying again, with the number of repetitions cut down, I got the following times:
The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of
times=100
.But why is it taking so long? First, the way the dataset is constructed, the
id
column contains about 63000 values:Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.
This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.
By contrast, if we just switch the variables that are being grouped and summarized over:
Now everything looks much more equal.
What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the
summarize
clause, so lets microbenchmark something similar to that:This is doing something a bit different to your code but illustrates the point. Pipes are slower.
Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to
unique
andlist
are pretty fast in R, so the whole difference here is the pipe overhead.Profiling expressions like this showed me most of the time is spent in the pipe functions:
then somewhere down in about 15th place the real work gets done:
Whereas if you just call the functions as Chambers intended, R gets straight down to it:
Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to
glm
with a few hundred data points, but that's another story....