I'm calculating Vandermonde matrix
for a fairly large 1D array. The natural and clean way to do this is using np.vander()
. However, I found that this is approx. 2.5x slower than a list comprehension based approach.
In [43]: x = np.arange(5000)
In [44]: N = 4
In [45]: %timeit np.vander(x, N, increasing=True)
155 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# one of the listed approaches from the documentation
In [46]: %timeit np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1)
65.3 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [47]: np.all(np.vander(x, N, increasing=True) == np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1))
Out[47]: True
I'm trying to understand where the bottleneck is and the reason why does the implementation of native np.vander()
is ~ 2.5x slower.
Efficiency matters for my implementation. So, even faster alternatives are also welcome!
Here are some more methods some of which are quite a bit faster (on my computer) than what has been posted so far.
The most important observation I think is that it really depends a lot on how many degrees you want. Exponentiation (which I believe is special cased for small integer exponents) only makes sense for small exponent ranges. The more exponents the better multiplication based approaches fare.
I'd like to highlight a
multiply.accumulate
based method (ma
) which is similar to numpy's builtin approach but faster (and not because I skimped on checks -nnc
, numpy-no-checks demonstrates this). For all but the smallest exponent ranges it is actually the fastest for me.For reasons I do not understand, the numpy implementation does three things that are to the best of my knowledge slow and unnecessary: (1) It makes quite a few copies of the base vector. (2) It makes them non-contiguous. (3) It does the accumulation in-place which I believe forces buffering.
Another thing I'd like to mention is that the fastest for small ranges of ints (
out_e_1
essentially a manual version ofma
), is slowed down by a factor of more than two by the simple precaution of promoting to a larger dtype (safe_e_1
arguably a bit of a misnomer).The broadcasting based methods are called
bc_*
where*
indicates the broadcast axis (b for base, e for exp) 'cheat' means the result is noncontiguous.Timings (best of 3):
Code:
How about broadcasted exponentiation?
Sanity check -
The caveat here is that this speedup is possible only if your input array
x
has adtype
ofint
. As @Warren Weckesser pointed out in a comment, the broadcasted exponentiation slows down for floating point arrays.As for why
np.vander
is slow, take a look at the source code -The function has to cater to a lot more use cases besides yours, so it uses a more generalized method of computation which is reliable, but slower (I'm specifically pointing to
multiply.accumulate
).As a matter of interest, I found another way of computing the Vandermonde matrix, ending up with this:
It does the same thing, but is so much slower. The answer lies in the fact that the operations are broadcast, but inefficiently.
On the flip side, for
float
arrays, this actually ends up performing the best.