I am trying out cython for first time. And tried to convert a function from using pure numpy to cython
Here are the two functions:
from __future__ import division
import numpy as np
cimport numpy as np
DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t
DTYPEi = np.int64
ctypedef np.int64_t DTYPEi_t
DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):
'''
Function to return correlation for fixed radius using Cython
'''
cdef float sum_mask = 0
cdef int i,j,k
cdef int a, b, c
cdef np.ndarray[DTYPEi_t, ndim=3] x
cdef np.ndarray[DTYPEi_t, ndim=3] y
cdef np.ndarray[DTYPEi_t, ndim=3] z
cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R
a,b,c = res//2,res//2,res//2
x,y,z = np.ogrid[-a:a,-b:b,-c:c]
for i in xrange(indexl,indexu):
for j in xrange(1):
for k in xrange(1):
R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
sum_mask += (data[i][j][k] * np.average(data[R]))
output.put(sum_mask)
And for numpy implementation:
def no_twodcitera(data, res, indexl, indexu, radius1, radius2, output, height1, height2 ):
'''
Function to return correlation for fixed radius
'''
a,b,c = res/2,res/2,res/2
x,y,z = np.ogrid[-a:a,-b:b,-c:c]
sum_mask = 0
for i in xrange(indexl,indexu):
for j in xrange(1):
for k in xrange(1):
R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
sum_mask += (data[i][j][k] * np.average(data[R]))
output.put(sum_mask)
Both functions actually give me same time for completion.
%timeit -n200 -r10 twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop
%timeit -n200 -r10 no_twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop
I was wondering what I was doing wrong or I haven't understood correctly when trying to implement cython. Inputs are:
dd = np.random.randn(64,64,64)
res = 64
r = np.arange(0,21,2)
in1 = 0
in2 = 1
l = 5
k = 7
output = mp.Queue()
Thank you if you could point out my misunderstanding here.
Without knowing your input and output the following compiled for me following the cython guide If you explain how to create test input i could maybe provide more help.
EDIT: My first thought was that there is maybe something up with the cython compilation. But i could not find anything realy helpful. Therfore this answer is not really helpful for improving the speed problem. Anyway i leave it here for those interested in the testing and understanding.
Put the code into test.pyx
cimport cython
import numpy as np
cimport numpy as np
DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t
DTYPEi = np.int64
ctypedef np.int64_t DTYPEi_t
DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t
@cython.boundscheck(False)
@cython.wraparound(False)
def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):
'''
Function to return correlation for fixed radius using Cython
'''
cdef float sum_mask = 0
cdef int i,j,k
cdef int a, b, c
cdef np.ndarray[DTYPEi_t, ndim=3] x
cdef np.ndarray[DTYPEi_t, ndim=3] y
cdef np.ndarray[DTYPEi_t, ndim=3] z
cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R
a,b,c = res//2,res//2,res//2
x,y,z = np.ogrid[-a:a,-b:b,-c:c]
for i in xrange(indexl,indexu):
for j in xrange(1):
for k in xrange(1):
R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
sum_mask += (data[i][j][k] * np.average(data[R]))
output.put(sum_mask)
Create a make file setup.py and put
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = "testapp",
ext_modules = cythonize('test.pyx'), # accepts a glob pattern
)
Go to the shell and compile it:
$python setup.py build_ext --inplace
Go to ipython and try to import:
from test import *
worked for me to run.
A speed test showed:
In [28]: %timeit -n200 -r10 no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.29 ms per loop
In [29]: %timeit -n200 -r10 test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.31 ms per loop
So the results are the same and there is not much difference. I furthermore conducted a cProfile study to see if there is something showing up in the runtime of the call stack. One has to confess, that cProfile gets hard to interpret when in comes to ms seconds speed! But lets give it a try.
In [34]: cProfile.run("""no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
82 function calls in 0.004 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.004 0.004 <ipython-input-27-663e142d15fb>:1(no_twodcitera)
1 0.000 0.000 0.004 0.004 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 _methods.py:43(_count_reduce_items)
1 0.000 0.000 0.000 0.000 _methods.py:53(_mean)
1 0.000 0.000 0.000 0.000 function_base.py:436(average)
1 0.000 0.000 0.000 0.000 index_tricks.py:151(__getitem__)
3 0.000 0.000 0.002 0.001 numeric.py:1279(roll)
1 0.000 0.000 0.000 0.000 numeric.py:394(asarray)
4 0.000 0.000 0.000 0.000 numeric.py:464(asanyarray)
1 0.000 0.000 0.000 0.000 queues.py:99(put)
1 0.000 0.000 0.000 0.000 threading.py:299(_is_owned)
1 0.000 0.000 0.000 0.000 threading.py:372(notify)
1 0.000 0.000 0.000 0.000 threading.py:63(_note)
1 0.000 0.000 0.000 0.000 {hasattr}
18 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {issubclass}
5 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {math.ceil}
1 0.000 0.000 0.000 0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
2 0.000 0.000 0.000 0.000 {method 'acquire' of 'thread.lock' objects}
1 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.000 0.000 {method 'release' of 'thread.lock' objects}
3 0.002 0.001 0.002 0.001 {method 'take' of 'numpy.ndarray' objects}
9 0.000 0.000 0.000 0.000 {numpy.core.multiarray.arange}
5 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {numpy.core.multiarray.concatenate}
4 0.000 0.000 0.000 0.000 {range}
1 0.000 0.000 0.000 0.000 {zip}
In [35]: cProfile.run("""test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
82 function calls in 0.003 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.003 0.003 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 _methods.py:43(_count_reduce_items)
1 0.000 0.000 0.000 0.000 _methods.py:53(_mean)
1 0.000 0.000 0.000 0.000 function_base.py:436(average)
1 0.000 0.000 0.000 0.000 index_tricks.py:151(__getitem__)
3 0.000 0.000 0.001 0.000 numeric.py:1279(roll)
1 0.000 0.000 0.000 0.000 numeric.py:394(asarray)
4 0.000 0.000 0.000 0.000 numeric.py:464(asanyarray)
1 0.000 0.000 0.000 0.000 queues.py:99(put)
1 0.000 0.000 0.000 0.000 threading.py:299(_is_owned)
1 0.000 0.000 0.000 0.000 threading.py:372(notify)
1 0.000 0.000 0.000 0.000 threading.py:63(_note)
1 0.000 0.000 0.000 0.000 {hasattr}
18 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {issubclass}
5 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {math.ceil}
1 0.000 0.000 0.000 0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
2 0.000 0.000 0.000 0.000 {method 'acquire' of 'thread.lock' objects}
1 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.000 0.000 {method 'release' of 'thread.lock' objects}
3 0.001 0.000 0.001 0.000 {method 'take' of 'numpy.ndarray' objects}
9 0.000 0.000 0.000 0.000 {numpy.core.multiarray.arange}
5 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {numpy.core.multiarray.concatenate}
4 0.000 0.000 0.000 0.000 {range}
1 0.001 0.001 0.003 0.003 {test.twodcitera}
1 0.000 0.000 0.000 0.000 {zip}
Sadly there is nothing popping up. I would conclude that the reason might be that numpy is already well implemented and most of the time is not lost in the nested loops. Furthermore cPython mostly benefits from static typing. Due to we are using numpy here this might not be a big benefit.