...regarding execution time and / or memory.
If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must come from apply
(tapply
, sapply
, ...) itself.
...regarding execution time and / or memory.
If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must come from apply
(tapply
, sapply
, ...) itself.
I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.
If you plan to save the result then apply family functions can be much more than syntactic sugar.
(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)
One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A
for
loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.
This winds up being much faster than the
for
loop and just a little slower than the built in optimizedtapply
function. It's not becausevapply
is so much faster thanfor
but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditionalfor
loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than thefor
version.The
apply
functions in R don't provide improved performance over other looping functions (e.g.for
). One exception to this islapply
which can be a little faster because it does more work in C code than in R (see this question for an example of this).But in general, the rule is that you should use an apply function for clarity, not for performance.
I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using
assign
or<<-
, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.Edit:
Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:
Edit 2:
Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide
apply
family functions (even theforeach
package is essentially equivalent, despite the name). Here's a simple example of thesapply
function insnow
:This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page).
snow
has the following apply functions:It makes sense that
apply
functions should be used for parallel execution since they have no side effects. When you change a variable value within afor
loop, it is globally set. On the other hand, allapply
functions can safely be used in parallel because changes are local to the function call (unless you try to useassign
or<<-
, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.Edit:
Here's a trivial example to demonstrate the difference between
for
and*apply
so far as side effects are concerned:Note how the
df
in the parent environment is altered byfor
but not*apply
.When applying functions over subsets of a vector,
tapply
can be pretty faster than a for loop. Example:apply
, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:But for these situations we've got
colSums
androwSums
:Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :
Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :
There you go. What did I win? ;-)
...and as I just wrote elsewhere, vapply is your friend! ...it's like sapply, but you also specify the return value type which makes it much faster.