Improving Cython Lapack performance with internal

I'm trying to speed up some matrix inversion in a model I'm building, specifically by implementing some linear algebra routines in Cython. I have the code working, but am trying to optimize. Currently my Cython looks like this:

import cython
import numpy as np
cimport numpy as cnp
cimport scipy.linalg.cython_lapack as cython_lapack

cdef int ZERO = 0
ctypedef cnp.float64_t REAL_t

def inverse(mat,identity,pivots,k):
    cdef int K = k
    cdef REAL_t* mat_pointer = <REAL_t *>cnp.PyArray_DATA(mat)
    cdef REAL_t* iden_pointer = <REAL_t *>cnp.PyArray_DATA(identity)
    cdef int* piv_pointer = <int *>cnp.PyArray_DATA(pivots)
    cython_lapack.dgesv(&K,&K,mat_pointer,&K,piv_pointer,iden_pointer,&K,&ZERO)

Once I compile (as linalg.pyx), I can run this just fine, but because dgesv modifies some of the input variables, I have to use a wrapper function and do some copying:

import linalg
import numpy as np

identity = np.eye(K) # K is the dimensionality of the square matrix
def lapack_inverse(a):
    b = identity.copy()
    linalg2.inverse(a,b,np.zeros(K,dtype=np.intc),K)
    return b

For small K, we see pretty solid speed improvements (2-3x):

In [29]: K=10
In [30]: x = np.random.random((K,K))
In [31]: identity = np.eye(K)
In [32]: print np.allclose(np.linalg.inv(x),lapack_inverse(x))
True
In [33]: %timeit np.linalg.inv(x)
10000 loops, best of 3: 28 µs per loop
In [34]: %timeit lapack_inverse(x)
100000 loops, best of 3: 10.3 µs per loop

But for larger K, returns clearly diminish (e.g. for K=200):

In [8]: %timeit np.linalg.inv(x)
100 loops, best of 3: 10.1 ms per loop
In [9]: %timeit lapack_inverse(x)
100 loops, best of 3: 8.1 ms per loop

I'm thinking I might be able to squeeze better performance out of this by defining the identity and pivot arrays within C, rather than doing the copying and passing that I'm currently doing. In other words, I'd like to modify the Cython function so that it only requires that I pass the array to be inverted.

My questions are (a): Will this lead to speed improvements (even small ones)? and, assuming it will, (b) how do I go about it defining the pivot and identity arrays in Cython directly?

There's lots of ways to allocate memory in Cython - essentially you can create any Python object or use the standard C allocation techniques. This previous question lists lots of them (with timings given in the answers) and is almost certainly generally more useful than my answer. It is dealing with a slightly different use-case though.

In your case you have to allocate the identity matrix as a numpy array, since the answer gets written into it and you want to return it from the function (you could obviously allocate it otherwise then copy, but this seems somewhat pointless). I suspect it's best to call eye rather than to copy it each time.

You can allocate pivot how you like. In the example below I've used the C allocation mechanisms as being the most direct way (and thus hopefully fastest) of allocating large chunks of memory. The downside of this is that you do have to deallocate it with free.

cimport numpy as cnp
import numpy as np
#cimport scipy.linalg.cython_lapack as cython_lapack
from libc.stdlib cimport malloc, free

ctypedef cnp.float64_t REAL_t

def inverse(mat):
    # you should probably check that mat.shape[0]==mat.shape[1]
    # and that mat is actually a float64 array
    cdef int k = mat.shape[0]
    cdef int info = 0

    cdef REAL_t* mat_pointer = <REAL_t*>cnp.PyArray_DATA(mat)
    # I suspect you should be doing a check that mat_pointer has been assigned
    cdef REAL_t* iden_pointer

    cdef int* piv_pointer = <int*>malloc(sizeof(int)*k)
                # this is uninitialised (the contents are arbitrary)
                # but that's OK because it's used as an output
    try:
        identity = np.eye(k)

        iden_pointer = <REAL_t*>cnp.PyArray_DATA(identity)

        cython_lapack.dgesv(&k,&k,mat_pointer,&K,
                     piv_pointer,iden_pointer,&K,&info)
        # you should check info to ensure it's worked
        return identity
    finally:
        free(piv_pointer) # the "try ... finally" ensures that this is freed

There's a few places where I've noted you should add checks to ensure everything is valid... this will slow things down but be safer. You could probably also add some Cython type definitions to get a bit of extra speed, but I haven't really investigated that for this example.

Finally - think about why you're using the inverse. If you're just multiplying another matrix by the inverse then it's better (faster and more accurate) to use dgsev directly with that matrix. There's rarely a good reason to calculate the inverse directly.