I have some independent computations I would like to do in parallel using Cython.
Right now I'm using this approach:
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
[...]
cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])
This methodology works but my problem comes from the fact that I in fact need several temporary_variable
s. This results in huge memory usage when INPUT_SIZE
grows. But I believe what is really needed is a temporary variable in each thread instead.
Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?
EDIT: The functions I was looking for were openmp.omp_get_max_threads()
and openmp.omp_get_thread_num()
to create a reasonably sized temporary array. I had to cimport openmp
first.
This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:
import numpy as np
from cython.parallel import prange
cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]
cdef double f2(double y) nogil:
return y+10
def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result
(this is basically the same as yours, but compilable). This compiles to the C code:
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
You can see that temporary_variable
is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:
cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)
Doing so forces temporary_variable
to be local to the function (and hence to the thread)
With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...
- I don't believe it's possible to create a thread-local memoryview.
- You could create a thread-local C array with
malloc
and free
but unless you have a good understanding of C then I would not recommend it.
The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:
cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]
def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)