Why devectorization in Julia is encouraged?

2019-02-06 05:21发布

问题:

Seems like writing devectorized code is encouraged in Julia. There is even a package that tries to do that for you.

My question is why?

First of all, speaking from the user experience aspect, vectorized code is more concise (less code, then less likelihood of bugs), more clear (hence easier to debug), more natural way of writing code (at least for someone who comes from scientific computing background, whom Julia tries to cater to). Being able to write something like vector'vector or vector'Matrix*vector is very important, because it corresponds to actual mathematical representation, and this is how scientific computing guys think of it in their head (not in nested loops). And I hate the fact that this is not the best way to write this, and reparsing it into loops will be faster.

At the moment it seems like there is a conflict between the goal of writing the code that is fast and the code that is concise/clear.

Secondly, what is the technical reason for this? Ok, I understand that vectorized code creates extra temporaries, etc., but vectorized functions (for example, broadcast(), map(), etc.) have a potential of multithreading them, and I think that the benefit of multithreading can outweigh the overhead of temporaries and other disadvantages of vectorized functions making them faster than regular for loops.

Do current implementations of vectorized functions in Julia do implicit multithreading under the hood?

If not, is there work / plans to add implicit concurrency to vectorized functions and to make them faster than loops?

回答1:

For easy reading I decided to turn my comment marathon above into an answer.

The core development statement behind Julia is "we are greedy". The core devs want it to do everything, and do it fast. In particular, note that the language is supposed to solve the "two-language problem", and at this stage, it looks like it will accomplish this by the time v1.0 hits.

In the context of your question, this means that everything you are asking about is either already a part of Julia, or planned for v1.0.

In particular, this means that if your programming problem lends itself to vectorized code, then write vectorized code. If it is more natural to use loops, use loops.

By the time v1.0 hits, most vectorized code should be as fast, or faster, than equivalent code in Matlab. In many cases, this development goal has already been achieved, since many vector/matrix operations in Julia are sent to the appropriate BLAS routines by the compiler.

Regarding multi-threading, native multi-threading is currently being implemented for Julia, and I believe an experimental set of routines is already available on the master branch. The relevant issue page is here. Implicit multithreading for some vector/matrix operations is already in theory available in Julia, since Julia calls BLAS. I'm not sure if it is switched on by default.

Be aware though, that many vectorized operations will still (currently) be much faster in MATLAB, since MATLAB have been writing specialised multi-threaded C libraries for years and then calling them under the hood. Once Julia has native multi-threading, I expect Julia will overtake MATLAB, since at that point the entire dev community can scour the standard Julia packages and upgrade them to take advantage of native multi-threading wherever possible.

In contrast, MATLAB does not have native multi-threading, so you are relying on Mathworks to provide specialised multi-threaded routines in the form of underlying C libraries.



回答2:

You can and should write vector'*matrix*vector (or perhaps dot(vector, matrix*vector) if you prefer a scalar output). For things like matrix multiplication, you're much better off using vectorized notation, because it calls the underlying BLAS libraries which are more heavily optimized than code produced by basically any language/compiler combination.

In other places, as you say you can benefit from devectorization by avoiding temporary intermediates: for example, if x is a vector, the expression

y = exp(x).*x + 5

creates 3 temporary vectors: one for a = exp(x), one for b = a.*x and one for y = b + 5. In contrast,

y = [exp(z)*z+5 for z in x]

creates no temporary intermediates. Since loops and comprehensions in julia are fast, there is no disadvantage to writing the devectorized version, and in fact it should perform slightly better (especially with performance annotations like @simd, where appropriate).

The arrival of threads may change things (making vectorized exp faster than a "naive" exp), but in general I'd say you should regard this as an "orthogonal" issue: julia will likely make multithreading so easy to use that you yourself might write operations using multiple threads, and consequently the vectorized "library" routine still has no advantage over code you might write yourself. In other words, you might use multiple threads but still write devectorized code to avoid those temporaries.

In the longer term, a "sufficiently smart compiler" may avoid temporaries by automatically devectorizing some of these operations, but this is a much harder route, with potential traps for the unwary, than it may seem.

Your statement that "vectorized code is always more concise, and easier to understand" is, however, not true: many times while writing Matlab code, you have to go to extremes to come up with a vectorized way of writing what are actually simple operations when thought of in terms of loops. You can search the mailing lists for countless examples; one that I recall on SO is How to find connected components in a matrix using Julia.