Based on what I've read before, vectorization is a form of parallelization known as SIMD. It allows processors to execute the same instruction (such as addition) on an array simultaneously.
However, I got confused when reading The Relationship between Vectorized and Devectorized Code regarding Julia's and R's vectorization performance. The post claims that devectorized Julia code (via loops) is faster than the vectorized code in both Julia and R, because:
This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its magic.
It claims that R turns vectorized code, written in R, into devectorized code in C. If vectorization is faster (as a form of parallelization), why would R devectorize the code and why is that a plus?
I think it is worth to note that the post you are referring to does not cover all current functionality of vectorization in Julia.
The important thing is that vectorization in Julia is implemented in Julia, as opposed to R, where it is implemented outside of the language. This is explained in detail in this post: https://julialang.org/blog/2017/01/moredots.
The consequence of the fact that Julia can perform fusion of any sequence of broadcasted operations into a single loop. In other languages that provide vectorization such a fusion is possible only if explicitly implemented.
In summary:
EDIT:
Following the comment of 李哲源 here is an example showing that Julia is able to avoid any allocations if you want to increase all elements of a vector
x
by1
:In the code
.+=
performs addition in place (adding$
in front of the expression is only needed for benchmarking, in normal code it would bex .+= 1
). And we see that no memory allocation was done.If we compare this to a possible implementation in R:
we can see that it not only saves memory, but also leads to a faster execution of the code.
"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function
cumsum
as an example. On entry, R interpreter sees that a vectorx
is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".Not just the
cumsum
function that takes a vector and returns a vector is seen as "vectorization" in R, functions likesum
that takes a vector and returns a scalar is also a "vectorization".Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.
Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.
R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?
More on "vectorization" in R
I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.
This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.
Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R.
rowSums(A * B)
is a "vectorization" in R, as both"*"
androwSums
are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrixC = A * B
into RAM.Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar
a
to a matrixA
byA + a
, what really happens is thata
is first replicated to be a matrixB
that has the same dimension withA
, i.e.,B <- matrix(a, nrow(A), ncol(A))
, then an addition between two matrices are calculated:A + B
. Clearly the generation of the temporary matrixB
is undesired, but sorry, you can't do it better unless you write your own C function forA + a
and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function
outer
. I have written many answers using this function, but it is particularly memory-unfriendly.I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.
c(crossprod(x, y))
is better thansum(x * y)
.