Whenever I want to do something "map"py in R, I usually try to use a function in the apply
family.
However, I've never quite understood the differences between them -- how {sapply
, lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputlapply(vec, f)
: same assapply
, but output is a list?apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix)tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col namesby(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column.aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr
or reshape
replace all of these entirely?
I recently discovered the rather useful
sweep
function and add it here for the sake of completeness:sweep
The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):
Let's say you have a matrix and want to standardize it column-wise:
NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)
From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:
(Hopefully it's clear that
apply
corresponds to @Hadley'saaply
andaggregate
corresponds to @Hadley'sddply
etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)(on the left is input, on the top is output)
Since I realized that (the very excellent) answers of this post lack of
by
andaggregate
explanations. Here is my contribution.BY
The
by
function, as stated in the documentation can be though, as a "wrapper" fortapply
. The power ofby
arises when we want to compute a task thattapply
can't handle. One example is this code:If we print these two objects,
ct
andcb
, we "essentially" have the same results and the only differences are in how they are shown and the differentclass
attributes, respectivelyby
forcb
andarray
forct
.As I've said, the power of
by
arises when we can't usetapply
; the following code is one example:R says that arguments must have the same lengths, say "we want to calculate the
summary
of all variable iniris
along the factorSpecies
": but R just can't do that because it does not know how to handle.With the
by
function R dispatch a specific method fordata frame
class and then let thesummary
function works even if the length of the first argument (and the type too) are different.it works indeed and the result is very surprising. It is an object of class
by
that alongSpecies
(say, for each of them) computes thesummary
of each variable.Note that if the first argument is a
data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with themean
function we will have this code that has no sense at all:AGGREGATE
aggregate
can be seen as another a different way of usetapply
if we use it in such a way.The two immediate differences are that the second argument of
aggregate
must be a list whiletapply
can (not mandatory) be a list and that the output ofaggregate
is a data frame while the one oftapply
is anarray
.The power of
aggregate
is that it can handle easily subsets of the data withsubset
argument and that it has methods forts
objects andformula
as well.These elements make
aggregate
easier to work with thattapply
in some situations. Here are some examples (available in documentation):We can achieve the same with
tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:There are other times when we can't use
by
ortapply
and we have to useaggregate
.We cannot obtain the previous result with
tapply
in one call but we have to calculate the mean alongMonth
for each elements and then combine them (also note that we have to call thena.rm = TRUE
, because theformula
methods of theaggregate
function has by default thena.action = na.omit
):while with
by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function,mean
):Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.
Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful
outer
function and the obscureeapply
functionouter
outer
is a very useful function hidden as a more mundane one. If you read the help forouter
its description says:which makes it seem like this is only useful for linear algebra type things. However, it can be used much like
mapply
to apply a function to two vectors of inputs. The difference is thatmapply
will apply the function to the first two elements and then the second two etc, whereasouter
will apply the function to every combination of one element from the first vector and one from the second. For example:I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.
eapply
eapply
is likelapply
except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.
First start with Joran's excellent answer -- doubtful anything can better that.
Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.
Mnemonics
lapply
is a list apply which acts on a list or vector and returns a list.sapply
is a simplelapply
(function defaults to returning a vector or matrix when possible)vapply
is a verified apply (allows the return object type to be prespecified)rapply
is a recursive apply for nested lists, i.e. lists within liststapply
is a tagged apply where the tags identify the subsetsapply
is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)Building the Right Background
If using the
apply
family still feels a bit alien to you, then it might be that you're missing a key point of view.These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the
apply
family of functions.Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and
apply
will make a lot more sense.R has many *apply functions which are ably described in the help files (e.g.
?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular
plyr
package, the base functions remain useful and worth knowing.This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code.For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.This is multivariate in the sense that your function must accept multiple arguments.
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
A factor (of the same length!) defining groups:
Add up the values in
x
within each subgroup defined byy
:More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.