I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.
This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.
I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.
import psutil
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from functools import partial
cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
Rectangle(int, int, int, int)
int x0, y0, x1, y1
int getArea() nogil
cdef class PyRectangle:
cdef Rectangle *rect
def __cinit__(self, int x0, int y0, int x1, int y1):
self.rect = new Rectangle(x0, y0, x1, y1)
def __dealloc__(self):
del self.rect
def testThread(self):
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
t0 = time.time()
func = partial(self.performCalc,maxDistance)
results = x.map(func,chunk)
results = np.vstack(list(results))
t1 = time.time()
print(t1-t0)
def performCalc(self,maxDistance,chunk):
cdef int area
cdef double[:,:] gPoints
gPoints = memoryview(chunk)
for i in range(0,len(gPoints)):
with nogil:
area = self.getArea2(gPoints[i])
return area
cdef int getArea2(self,double[:] p) nogil :
cdef int area
area = self.rect.getArea()
return area