I am attempting to find the fastest way to sum horizontally a numpy array of arrays using Cython. To get things started, let's assume I have a single 2D array of random floats 10 x 100,000. I can create an object
array with each column as a value in the array as follows:
n = 10 ** 5
a = np.random.rand(10, n)
a_obj = np.empty(n, dtype='O')
for i in range(n):
a_obj[i] = a[:, i]
All I would like to do is find the sum of each row. They can both be trivially computed as such:
%timeit a.sum(1)
414 µs ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_obj.sum()
113 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The object array is 250x slower.
Before even trying to sum things up, I wanted to time access to each element. Cython is unable to speed up access to each member of the object array when iterating through each item directly:
def access_obj(ndarray[object] a):
cdef int i
cdef int nc = len(a)
cdef int nr = len(a[0])
for i in range(nc):
for j in range(nr):
a[i][j]
%timeit access_obj(a_obj)
42.1 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But, I believe I can access the underlying C arrays with the data
attribute which is a memoryview
object and get much faster access:
def access_obj_data(ndarray[object] a):
cdef int i
cdef int nc = len(a)
cdef int nr = len(a[0])
cdef double **data = <double**>a.data
for i in range(nc):
for j in range(nr):
data[i][j]
I think this caches results when timing so I had to do it once.
%timeit -n 1 -r 1 access_obj_data(a_obj)
8.17 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
The problem is that you can't just cast a.data
into a C array of doubles. If I print out the first column, I get the following.
5e-324
2.1821467023e-314
2.2428810855e-314
2.1219957915e-314
6.94809615162997e-310
6.94809615163037e-310
2.2772194067e-314
2.182150145e-314
2.1219964234e-314
0.0
There are similar questions, but nothing explicitly answers a way to do this operation fast with code that works.
I guess there are two different questions:
- Why is
a_obj.sum()
so slow?
- Why is the cython-function code slow?
We will see that slowness of the a_obj.sum()
can be tracked to the overhead of Python-dispatch + cache misses. However, this is a result of a quite superfluous analysis, there might be more subtle reasons.
The second question is less interesting: We will see, that Cython doesn't know enough about the objects in the array to be able to speed the function up.
There is also the question "what is the fastest way?". The answer is "it depends on your data and your hardware", as it is almost always for this kind of question.
However, we will use the gained insights to improve slightly on the existing versions.
Let's start with "a_obj.sum()
is slow".
As baseline for your example on my machine (yep, factor 200
):
>>> %timeit a.sum(1)
510 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_obj.sum()
90.3 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
First what is happening in a_obj.sum()
? The reduction algorithm goes like:
result=a_obj[0]+a_obj[1]+a_obj[2]+...
However, the algorithm doesn't know that a_obj[i]
are numpy-arrays, so it has to use the Python-dispatch to know, that a_obj[i]+a_obj[i+1]
is actually addition of two numpy-arrays. This is quite an overhead to sum up 10
doubles which we have to pay 10**5
times. I answered a question about cost of Python-dispatch some time ago, if you are interested in more details.
If the shape of our array would be 10**5x10
, we would pay the overhead only 10
times it would be not large compared with the work of 10**5
additions:
>>> b=np.random.rand(n,10)
>>> b_obj=to_obj_array(b) # see listing at the end for code
>>> %timeit b.sum(1)
2.43 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit b_obj.sum()
5.53 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, the factor (about 2) is much smaller now. However, the Python-dispatch-overhead theory cannot explain everything.
Let's take a look at the following sizes of array:
>>> c=np.random.rand(10**4,10**4)
>>> c_obj=to_obj_array(c)
>>> %timeit c.sum(1)
52.5 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit c_obj.sum()
369 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This time the factor is about 8! The reason is the cache-misses in the obj-version. This is a memory-bound task, so basically the memory-speed is the bottle-neck.
In the numpy-array the elements are stored continuously in memory, one row after another. Every time a double is fetched from memory it is fetched with 7 double-neighbors and these 8 double are saved in cache.
So when c.sum(1)
runs the information is needed row-wise: every time a double is fetched from memory into the cache, the next 7 numbers are prefetched, so they can be sum-up. This is the most efficient memory-access-pattern.
Differently the object-version. The information is needed column-wise. So every time we fetch additional 7
doubles into the cache we don't need it yet, and at time when we could process these doubles they are already evicted from cache (because cache isn't big enough for all numbers from a column) and need to be fetched from the slow memory again.
That means with the second approach, every double will be read 8 times from memory or there will be 8 times more reads from memory/cache-misses.
We can easily verify that using valgrind's cachegrind to measure cache-misses for both variants (see listings at the end for test_cache.py`):
valgrind --tool=cachegrind python test_cache.py array
...
D1 misses: 144,246,083
...
valgrind --tool=cachegrind python test_cache.py object
...
D1 misses: 1,285,664,578
...
i.e c_obj
-version has about 8-times more D1-cache-misses.
But there is more fun, the object-version could be faster!
>>> d=np.random.rand(800,6)
>>> d_obj=to_obj_array(d)
>>> %timeit d.sum(1)
20.2 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit d_obj.sum()
12.4 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
How could that be? They could be similar fast, because
- The whole array can be saved in cache, so there are no additional cache-misses.
- The python-dispatch overhead is bearable.
but two times faster for otherwise slower version?
I'm not 100% sure, but here is my theory: For the array-version, we calculate the sum for the row i
:
...
sum_i=sum_i+a[i][j]
sum_i=sum_i+a[i][j+1]
sum_i=sum_i+a[i][j+2]
...
so there hardware cannot execute the instructions in parallel, because the result of the previous operation is needed to calculate the next.
However for the object-version, the calculation looks as follows:
...
sum_i=sum_i+a[i][j]
sum_k=sum_k+a[k][j]
sum_m=sum_m+a[i][j]
...
These steps can be done in parallel by CPU because there is no dependency between them.
So, why is cython so slow? Because cython doesn't know anything about the objects in the array it is not really faster than the usual Python-code, which is very slow!
Let's break down the cython-function access_obj
:
- Cython knows nothing about
a[i]
and assumes it is a generic Python-object.
- That means, for
a[i][j]
it must call __get_item__
of a[i]
with help of Python-infrastructure. Btw. for that j
must be converted to a full-fledged Python-integer.
__get_item__
constructs a full-fledged Python-float from the stored c-double and returns it to cython.
- cast returned Python-float to a c-float.
Below the bottom line, we hat used Python-dispatch and created two temporary Python-objects - you can see the needed costs.
Speeding up the summation
We have seen, that the summation is a memory-bound-task (at least on my machine), so the most important thing is to avoid cache-misses (this paramount could also mean for example, that we need different algorithms for C-order and F-order numpy-arrays).
However, if our task has to share a CPU with a CPU-heave task it might again become de facto CPU-bound, so it has some merits to focus on more than only the cash-misses.
In the C-version, which will be introduced later on, we would like to combine the good properties of both approaches:
- Similar to object-version we would like to calculate several row-sums in parallel, in order to be able to use the intrinsic parallelism of the CPU.
- Similar to numpy-array-sum, we would like to have minimal number cache-misses, thus we calculate not all rows simultaneously but only a small block (e.g. 8). This block-wise calculation has the advantage, that the data once cached doesn't get evicted from cache, before it is really needed.
Here is a simplified version, which processes rows in bunches of 8:
//fast.c
#define MAX 8
void sum_rows_blocks(double *orig, double *out, int n_rows, int n_cols){
int n_blocks=n_rows/MAX;
for (int b=0; b<n_blocks; b++){ //for every block
double res[MAX]={0.0}; //initialize to 0.0
for(int j=0;j<n_cols;j++){
for(int i=0;i<MAX;i++){ //calculate sum for MAX-rows simultaniously
res[i]+=orig[(b*MAX+i)*n_cols+j];
}
}
for(int i=0;i<MAX;i++){
out[b*MAX+i]=res[i];
}
}
//left_overs:
int left=n_rows-n_blocks*MAX;
double res[MAX]={0.0}; //initialize to 0.0
for(int j=0;j<n_cols;j++){
for(int i=0;i<left;i++){ //calculate sum for left rows simultaniously
res[i]+=orig[(n_blocks*MAX)*n_cols+j];
}
}
for(int i=0;i<left;i++){
out[n_blocks*MAX+i]=res[i];
}
}
Now we can use cython to wrap this function:
//c_sum.pyx
cimport numpy as np
import numpy as np
cdef extern from "fast.c":
void sum_rows_blocks(double *orig, double *out, int n_rows, int n_cols)
def sum_rows_using_blocks(double[:,:] arr):
cdef int n_rows
cdef int n_cols
cdef np.ndarray[np.double_t] res
n_rows=arr.shape[0]
n_cols=arr.shape[1]
res=np.empty((n_rows, ),'d')
sum_rows_blocks(&arr[0,0], &res[0], n_rows, n_cols)
return res
And after building the extension (see listing for setup.py
), we can do the comparison:
import c_sum
%timeit c_sum.sum_rows_using_blocks(d) #shape=(800,6)
4.26 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
It factor 3
faster than the fastest version so far (12 µs). How does it fare for large arrays:
%timeit c_sum.sum_rows_using_blocks(c) #shape=(10**4,10**4)
49.7 ms ± 600 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good, it is slightly faster than c.sum(1)
which takes 52
ms, that means we don't have additional cache misses.
Can it be improved? I do think so: For example not using the default compile flag -fwrapv
(by including extra_compile_args = ["-fno-wrapv"]
in the setup) will improve the resulting assembly and lead to a 10% speedup for the summation for the shape (800, 6)
.
Listings:
def to_obj_array(a):
n=a.shape[1]
obj=np.empty((n,),dtype='O')
for i in range(n):
obj[i]=a[:,i]
return obj
check_cache.py
:
import sys
import numpy as np
def to_obj_array(a):
n=a.shape[1]
obj=np.empty((n,),dtype='O')
for i in range(n):
obj[i]=a[:,i]
return obj
c=np.random.rand(10**4,10**4)
c_obj=to_obj_array(c)
for i in range(10):
if sys.argv[1]=="array":
c.sum(1)
elif sys.argv[1]=="object":
c_obj.sum()
else:
raise Exception("unknown parameter")
setup.py
:
#setup.py
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy as np
setup(ext_modules=cythonize(Extension(
name='c_sum',
sources = ["c_sum.pyx"],
#extra_link_args = ["fast.o"],
include_dirs=[np.get_include()]
)))