Why is cffi so much quicker than numpy?

I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.

This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?

def cast_matrix(matrix, ffi):
    ap = ffi.new("double* [%d]" % (matrix.shape[0]))
    ptr = ffi.cast("double *", matrix.ctypes.data)
    for i in range(matrix.shape[0]):
        ap[i] = ptr + i*matrix.shape[1]                                                                
    return ap 

ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
    int i, j; 
    double sum = 0.0;
    for (i=0; i<x; i++){
        for (j=0; j<y; j++){
            sum = sum + matrix[i][j];
        }
    }
    return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()

m_p = cast_matrix(m, ffi)

sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm

just to show the function works:

numpy says 100.0
cffi says 100.0

now if I time this simple function I find that numpy is really slow! Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?

import time
n = 1000000

t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()

print 'cffi', t1-t0

t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()

print 'numpy', t1-t0

times:

cffi 0.818415880203
numpy 5.61657714844

Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.

Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.

For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.