This question is a follow up to my answer in Efficient way to compute the Vandermonde matrix.
Here's the setup:
x = np.arange(5000) # an integer array
N = 4
Now, I'll compute the Vandermonde matrix in two different ways:
m1 = (x ** np.arange(N)[:, None]).T
And,
m2 = x[:, None] ** np.arange(N)
Sanity check:
np.array_equal(m1, m2)
True
These methods are identical, but their performance is not:
%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, the first method, despite requiring a transposition at the end, is still over 3X faster than the second method.
The only difference is that in the first case, the smaller array is broadcasted, whereas with the second case, it is the larger.
So, with a fairly decent understanding of how numpy works, I can guess that the answer would involve the cache. The first method is a lot more cache friendly than the second. However, I'd like an official word from someone with more experience than me.
What could be the reason for this stark contrast in timings?
I'm not convinced caching is really the single most influential factor here.
I'm also not a trained computer scientist, so I may well be wrong, but let me walk you through a couple of obervations. For simplicity I'm using @hpaulj's call that '+' shows essentially the same effect as '**'.
My working hypthesis is that it is the overhead of the outer loops which I believe are substantially more expensive than the contiguous vectorizable innermost loops.
So let us first minimize the amount of data that repeat, so caching is unlikely to have much impact:
Here both scenarios have contiguous data, the same number of additions but the version with 5000 outer iterations is substantially slower. When we bring back caching albeit across trials the difference stays roughly the same but the ratio becomes even more pronounced:
Returning to the original "outer sum" scenario we see that in the noncontiguous long dimension case we are getting even worse. Since we have to read no more data than in the contiguous scenario this cannot be explained by data not being cached.
Further both profit from across trial caching:
From a cachist's point of view this is inconclusive at best.
So let's have a look at the source. After building a current NumPy from the tarball you'll find somewhere in the tree almost 15000 lines worth of computer generated code in a file called 'loops.c'. These loops are the innermost loops of ufuncs, the most relevant bit to our situation appears to be
The payload in our case seems lean enough, especially if I interpret the comment about contiguous specialization and autovectorization correctly. Now, if we do only 4 iterations the overhead to payload ratio starts to look a bit troubling and it doesn't end here.
In the file ufunc_object.c we find the following snippet
the actual loop looks like
innerloop is the inner loop we looked at above. How much overhead comes with iternext?
For this we need to turn to file nditer_templ.c.src where we find
This function returns a function pointer to one of the things the preprocessing makes of
Parsing this is beyond me, but in any case it is a function pointer that must be called at every iteration of the outer loop, and as far as I know function pointers cannot be inlined, so compared to 4 iterations of a trivial loop body this will be sustantial.
I should probably profile this but my skills are insufficient.
While I'm afraid my conclusion won't be more fundamental than yours ("probably caching"), I believe I can help on focusing our attention with a set of more localized tests.
Consider your example problem:
As you can see, I defined a bunch of arrays to compare.
x1
,y1
andx2
,y2
, respectively, correspond to your original test cases.??_bc
correspond to explicitly broadcast versions of these arrays. These share the data with the original ones, but they have explicit 0-strides in order to get the appropriate shape. Finally,??_cont
are contiguous versions of these broadcast arrays, as if constructed withnp.tile
.So both
x1_bc
,y1_bc
,x1_cont
andy1_cont
have shape(4, 5000)
, but while the former two have zero-strides, the latter two are contiguous arrays. For all intents and purposes taking the power of any of these corresponding pairs of arrays should give us the same contiguous result (as hpaulj noted in a comment, a transposition itself is essentially for free, so I'm going to ignore that outermost transpose in the following).Here are the timings corresponding to your original check:
Here are the timings for the explicitly broadcast arrays:
Same thing. This tells me that the discrepancy isn't somehow due to the transition from the indexed expressions to the broadcast arrays. This was mostly expected, but it never hurts to check.
Finally, the contiguous arrays:
A huge part of the discrepancy goes away!
So why did I check this? There is a general rule of thumb that you can work with CPU caching if you use vectorized operations along large trailing dimensions in python. To be more specific, for row-major ("C-order") arrays trailing dimensions are contiguous, while for column-major ("fortran-order") arrays the leading dimensions are contiguous. For large enough dimensions
arr.sum(axis=-1)
should be faster thanarr.sum(axis=0)
for row-major numpy arrays give or take some fine print.What happens here is that there is a huge difference between the two dimensions (size 4 and 5000, respectively), but the huge performance asymmetry between the two transposed cases only happens for the broadcasting case. My admittedly handwaving impression is that broadcasting uses 0-strides to construct views of appropriate size. These 0-strides imply that in the faster case memory access looks like this for the long
x
array:where
mem*
just denotes afloat64
value ofx
sitting somewhere in RAM. Compare this to the slower case where we're working with shape(5000,4)
:My naive notion is that working with the former allows the CPU to cache larger chunks of the individual values of
x
at a time, so performance is great. In the latter case the 0-strides make the CPU hop around on the same memory address ofx
four times each, doing this 5000 times. I find it reasonable to believe that this setup works against caching, leading to overall bad performance. This would also be in agreement with the fact that the contiguous cases don't show this performance hit: there the CPU has to work with all 5000*4 uniquefloat64
values, and caching might not be impeded by these weird reads.I too tried to look at
broadcast_arrays
:np.ascontiguousarray
converts the 0 strided dimensions into full onesOperating with the full arrays is faster:
So there's some sort of time penalty to using the 0-strided arrays, even if it doesn't explicitly expand the arrays first. And cost is tied to the shapes, and possibly which dimension is expanded.