I am trying out cython for first time. And tried to convert a function from using pure numpy to cython
Here are the two functions:
from __future__ import division
import numpy as np
cimport numpy as np
DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t
DTYPEi = np.int64
ctypedef np.int64_t DTYPEi_t
DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):
'''
Function to return correlation for fixed radius using Cython
'''
cdef float sum_mask = 0
cdef int i,j,k
cdef int a, b, c
cdef np.ndarray[DTYPEi_t, ndim=3] x
cdef np.ndarray[DTYPEi_t, ndim=3] y
cdef np.ndarray[DTYPEi_t, ndim=3] z
cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R
a,b,c = res//2,res//2,res//2
x,y,z = np.ogrid[-a:a,-b:b,-c:c]
for i in xrange(indexl,indexu):
for j in xrange(1):
for k in xrange(1):
R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
sum_mask += (data[i][j][k] * np.average(data[R]))
output.put(sum_mask)
And for numpy implementation:
def no_twodcitera(data, res, indexl, indexu, radius1, radius2, output, height1, height2 ):
'''
Function to return correlation for fixed radius
'''
a,b,c = res/2,res/2,res/2
x,y,z = np.ogrid[-a:a,-b:b,-c:c]
sum_mask = 0
for i in xrange(indexl,indexu):
for j in xrange(1):
for k in xrange(1):
R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
sum_mask += (data[i][j][k] * np.average(data[R]))
output.put(sum_mask)
Both functions actually give me same time for completion.
%timeit -n200 -r10 twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop
%timeit -n200 -r10 no_twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop
I was wondering what I was doing wrong or I haven't understood correctly when trying to implement cython. Inputs are:
dd = np.random.randn(64,64,64)
res = 64
r = np.arange(0,21,2)
in1 = 0
in2 = 1
l = 5
k = 7
output = mp.Queue()
Thank you if you could point out my misunderstanding here.
Without knowing your input and output the following compiled for me following the cython guide If you explain how to create test input i could maybe provide more help.
EDIT: My first thought was that there is maybe something up with the cython compilation. But i could not find anything realy helpful. Therfore this answer is not really helpful for improving the speed problem. Anyway i leave it here for those interested in the testing and understanding.
Put the code into test.pyx
Create a make file setup.py and put
Go to the shell and compile it:
Go to ipython and try to import:
worked for me to run.
A speed test showed:
So the results are the same and there is not much difference. I furthermore conducted a cProfile study to see if there is something showing up in the runtime of the call stack. One has to confess, that cProfile gets hard to interpret when in comes to ms seconds speed! But lets give it a try.
Sadly there is nothing popping up. I would conclude that the reason might be that numpy is already well implemented and most of the time is not lost in the nested loops. Furthermore cPython mostly benefits from static typing. Due to we are using numpy here this might not be a big benefit.