So we are used to say to every R new user that "apply
isn't vectorized, check out the Patrick Burns R Inferno Circle 4" which says (I quote):
A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition. The lapply function buries the loop, but execution times tend to be roughly equal to an explicit for loop.
Indeed, a quick look on the apply
source code reveals the loop:
grep("for", capture.output(getAnywhere("apply")), value = TRUE)
## [1] " for (i in 1L:d2) {" " else for (i in 1L:d2) {"
Ok so far, but a look at lapply
or vapply
actually reveals a completely different picture:
lapply
## function (X, FUN, ...)
## {
## FUN <- match.fun(FUN)
## if (!is.vector(X) || is.object(X))
## X <- as.list(X)
## .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000000284b618>
## <environment: namespace:base>
So apparently there is no R for
loop hiding there, rather they are calling internal C written function.
A quick look in the rabbit hole reveals pretty much the same picture
Moreover, let's take the colMeans
function for example, which was never accused in not being vectorised
colMeans
# function (x, na.rm = FALSE, dims = 1L)
# {
# if (is.data.frame(x))
# x <- as.matrix(x)
# if (!is.array(x) || length(dn <- dim(x)) < 2L)
# stop("'x' must be an array of at least two dimensions")
# if (dims < 1L || dims > length(dn) - 1L)
# stop("invalid 'dims'")
# n <- prod(dn[1L:dims])
# dn <- dn[-(1L:dims)]
# z <- if (is.complex(x))
# .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) *
# .Internal(colMeans(Im(x), n, prod(dn), na.rm))
# else .Internal(colMeans(x, n, prod(dn), na.rm))
# if (length(dn) > 1L) {
# dim(z) <- dn
# dimnames(z) <- dimnames(x)[-(1L:dims)]
# }
# else names(z) <- dimnames(x)[[dims + 1]]
# z
# }
# <bytecode: 0x0000000008f89d20>
# <environment: namespace:base>
Huh? It also just calls .Internal(colMeans(...
which we can also find in the rabbit hole. So how is this different from .Internal(lapply(..
?
Actually a quick benchmark reveals that sapply
performs no worse than colMeans
and much better than a for
loop for a big data set
m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
system.time(colMeans(m))
# user system elapsed
# 1.69 0.03 1.73
system.time(sapply(m, mean))
# user system elapsed
# 1.50 0.03 1.60
system.time(apply(m, 2, mean))
# user system elapsed
# 3.84 0.03 3.90
system.time(for(i in 1:ncol(m)) mean(m[, i]))
# user system elapsed
# 13.78 0.01 13.93
In other words, is it correct to say that lapply
and vapply
are actually vectorised (compared to apply
which is a for
loop that also calls lapply
) and what did Patrick Burns really mean to say?
So to sum the great answers/comments up into some general answer and provide some background: R has 4 types of loops (in from not-vectorized to vectorized order)
for
loop that repeatedly calls R functions in each iterations (Not vectorised)So the
*apply
family is the second type. Exceptapply
which is more of the first typeYou can understand this from the comment in its source code
That means that
lapply
s C code accepts an unevaluated function from R and later evaluates it within the C code itself. This is basically the difference betweenlapply
s.Internal
callWhich has a
FUN
argument that holds an R functionAnd the
colMeans
.Internal
call which does not have aFUN
argumentcolMeans
, unlikelapply
knows exactly what function it needs to use, thus it calculates the mean internally within the C code.You can clearly see the evaluation process of the R function in each iteration within
lapply
C codeTo sum things up,
lapply
is not vectorized, though it has two possible advantages over the plain Rfor
loopAccessing and assigning in a loop seems to be faster in C (i.e. in
lapply
ing a function) Although the difference seems big, we, still, stay at the microsecond level and the costly thing is the valuation of an R function in each iteration. A simple example:As mentioned by @Roland, it runs a compiled C loop rather an interpreted R loop
Though when vectorizing your code, there are some things you need to take into account.
df
) is of classdata.frame
, some vectorized functions (such ascolMeans
,colSums
,rowSums
, etc.) will have to convert it to a matrix first, simply because this is how they were designed. This means that for a bigdf
this can create a huge overhead. Whilelapply
won't have to do this as it extracts the actual vectors out ofdf
(asdata.frame
is just a list of vectors) and thus, if you have not so many columns but many rows,lapply(df, mean)
can sometimes be better option thancolMeans(df)
..Primitive
, and generic (S3
,S4
) see here for some additional information. The generic function have to do a method dispatch which sometimes a costly operation. For example,mean
is genericS3
function whilesum
isPrimitive
. Thus some timeslapply(df, sum)
could be very efficient comparedcolSums
from the reasons listed aboveI agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:
Consider this
C
code snippet:What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.
The most obvious way is that the code is run in a sequential manner. Load
a[i]
andb[i]
on to registers, add them, store the result inc[i]
, and do this for eachi
.However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from
a
andb
under the same instruction, instead of one at a time.It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of
GNU-gcc
here. Similarly forLLVM-clang
here. And you can also find some benchmarks in the last link compared againstgcc
andICC
(Intel C++ compiler).gcc
(I'm onv4.9
) for example doesn't vectorise code automatically at-O2
level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.We either need to add the flag
-ftree-vectorize
or change optimisation to level-O3
. (Note that-O3
performs other additional optimisations as well). The flag-fopt-info-vec
is useful as it informs when a loop was successfully vectorised).This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:
This part can be safely skipped without losing continuity.
Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.
By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag
-fopenmp
. By doing that:which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.
In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).
In case of R, even
rowSums()
orcolSums()
code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes forlapply()
. In case ofapply()
, it's in R. All of these are therefore loop hiding.HTH
References:
To me, vectorisation is primarily about making your code easier to write and easier to understand.
The goal of a vectorised function is to eliminate the book-keeping associated with a for loop. For example, instead of:
You can write:
That makes it easier to see what's the same (the input data) and what's different (the function you're applying).
A secondary advantage of vectorisation is that the for-loop is often written in C, rather than in R. This has substantial performance benefits, but I don't think it's the key property of vectorisation. Vectorisation is fundamentally about saving your brain, not saving the computer work.
First of all, in your example you make tests on a "data.frame" which is not fair for
colMeans
,apply
and"[.data.frame"
since they have an overhead:On a matrix, the picture is a bit different:
Regading the main part of the question, the main difference between
lapply
/mapply
/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?An example where we find the mean in each of a "list"s elements:
(EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple
sum(x) / length(x)
and (2) it should make more sense to test on "list"s withlength(x) >> lengths(x)
. So, the "mean" example is moved to the end and replaced with another.)As a simple example we could consider the finding of the opposite of each
length == 1
element of a "list":In a
tmp.c
file:And in R side:
with data:
Benchmarking:
(Follows the original example of mean finding):