I am attempting to find the fastest way to sum horizontally a numpy array of arrays using Cython. To get things started, let's assume I have a single 2D array of random floats 10 x 100,000. I can create an object
array with each column as a value in the array as follows:
n = 10 ** 5
a = np.random.rand(10, n)
a_obj = np.empty(n, dtype='O')
for i in range(n):
a_obj[i] = a[:, i]
All I would like to do is find the sum of each row. They can both be trivially computed as such:
%timeit a.sum(1)
414 µs ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_obj.sum()
113 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The object array is 250x slower.
Before even trying to sum things up, I wanted to time access to each element. Cython is unable to speed up access to each member of the object array when iterating through each item directly:
def access_obj(ndarray[object] a):
cdef int i
cdef int nc = len(a)
cdef int nr = len(a[0])
for i in range(nc):
for j in range(nr):
a[i][j]
%timeit access_obj(a_obj)
42.1 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But, I believe I can access the underlying C arrays with the data
attribute which is a memoryview
object and get much faster access:
def access_obj_data(ndarray[object] a):
cdef int i
cdef int nc = len(a)
cdef int nr = len(a[0])
cdef double **data = <double**>a.data
for i in range(nc):
for j in range(nr):
data[i][j]
I think this caches results when timing so I had to do it once.
%timeit -n 1 -r 1 access_obj_data(a_obj)
8.17 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
The problem is that you can't just cast a.data
into a C array of doubles. If I print out the first column, I get the following.
5e-324
2.1821467023e-314
2.2428810855e-314
2.1219957915e-314
6.94809615162997e-310
6.94809615163037e-310
2.2772194067e-314
2.182150145e-314
2.1219964234e-314
0.0
There are similar questions, but nothing explicitly answers a way to do this operation fast with code that works.
I guess there are two different questions:
a_obj.sum()
so slow?We will see that slowness of the
a_obj.sum()
can be tracked to the overhead of Python-dispatch + cache misses. However, this is a result of a quite superfluous analysis, there might be more subtle reasons.The second question is less interesting: We will see, that Cython doesn't know enough about the objects in the array to be able to speed the function up.
There is also the question "what is the fastest way?". The answer is "it depends on your data and your hardware", as it is almost always for this kind of question.
However, we will use the gained insights to improve slightly on the existing versions.
Let's start with "
a_obj.sum()
is slow".As baseline for your example on my machine (yep, factor
200
):First what is happening in
a_obj.sum()
? The reduction algorithm goes like:However, the algorithm doesn't know that
a_obj[i]
are numpy-arrays, so it has to use the Python-dispatch to know, thata_obj[i]+a_obj[i+1]
is actually addition of two numpy-arrays. This is quite an overhead to sum up10
doubles which we have to pay10**5
times. I answered a question about cost of Python-dispatch some time ago, if you are interested in more details.If the shape of our array would be
10**5x10
, we would pay the overhead only10
times it would be not large compared with the work of10**5
additions:As you can see, the factor (about 2) is much smaller now. However, the Python-dispatch-overhead theory cannot explain everything.
Let's take a look at the following sizes of array:
This time the factor is about 8! The reason is the cache-misses in the obj-version. This is a memory-bound task, so basically the memory-speed is the bottle-neck.
In the numpy-array the elements are stored continuously in memory, one row after another. Every time a double is fetched from memory it is fetched with 7 double-neighbors and these 8 double are saved in cache.
So when
c.sum(1)
runs the information is needed row-wise: every time a double is fetched from memory into the cache, the next 7 numbers are prefetched, so they can be sum-up. This is the most efficient memory-access-pattern.Differently the object-version. The information is needed column-wise. So every time we fetch additional
7
doubles into the cache we don't need it yet, and at time when we could process these doubles they are already evicted from cache (because cache isn't big enough for all numbers from a column) and need to be fetched from the slow memory again.That means with the second approach, every double will be read 8 times from memory or there will be 8 times more reads from memory/cache-misses.
We can easily verify that using valgrind's cachegrind to measure cache-misses for both variants (see listings at the end for test_cache.py`):
i.e
c_obj
-version has about 8-times more D1-cache-misses.But there is more fun, the object-version could be faster!
How could that be? They could be similar fast, because
but two times faster for otherwise slower version?
I'm not 100% sure, but here is my theory: For the array-version, we calculate the sum for the row
i
:so there hardware cannot execute the instructions in parallel, because the result of the previous operation is needed to calculate the next.
However for the object-version, the calculation looks as follows:
These steps can be done in parallel by CPU because there is no dependency between them.
So, why is cython so slow? Because cython doesn't know anything about the objects in the array it is not really faster than the usual Python-code, which is very slow!
Let's break down the cython-function
access_obj
:a[i]
and assumes it is a generic Python-object.a[i][j]
it must call__get_item__
ofa[i]
with help of Python-infrastructure. Btw. for thatj
must be converted to a full-fledged Python-integer.__get_item__
constructs a full-fledged Python-float from the stored c-double and returns it to cython.Below the bottom line, we hat used Python-dispatch and created two temporary Python-objects - you can see the needed costs.
Speeding up the summation
We have seen, that the summation is a memory-bound-task (at least on my machine), so the most important thing is to avoid cache-misses (this paramount could also mean for example, that we need different algorithms for C-order and F-order numpy-arrays).
However, if our task has to share a CPU with a CPU-heave task it might again become de facto CPU-bound, so it has some merits to focus on more than only the cash-misses.
In the C-version, which will be introduced later on, we would like to combine the good properties of both approaches:
Here is a simplified version, which processes rows in bunches of 8:
Now we can use cython to wrap this function:
And after building the extension (see listing for
setup.py
), we can do the comparison:It factor
3
faster than the fastest version so far (12 µs). How does it fare for large arrays:Good, it is slightly faster than
c.sum(1)
which takes52
ms, that means we don't have additional cache misses.Can it be improved? I do think so: For example not using the default compile flag
-fwrapv
(by includingextra_compile_args = ["-fno-wrapv"]
in the setup) will improve the resulting assembly and lead to a 10% speedup for the summation for the shape(800, 6)
.Listings:
check_cache.py
:setup.py
: