I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.
This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.
I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.
import psutil
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from functools import partial
cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
Rectangle(int, int, int, int)
int x0, y0, x1, y1
int getArea() nogil
cdef class PyRectangle:
cdef Rectangle *rect
def __cinit__(self, int x0, int y0, int x1, int y1):
self.rect = new Rectangle(x0, y0, x1, y1)
def __dealloc__(self):
del self.rect
def testThread(self):
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
t0 = time.time()
func = partial(self.performCalc,maxDistance)
results = x.map(func,chunk)
results = np.vstack(list(results))
t1 = time.time()
print(t1-t0)
def performCalc(self,maxDistance,chunk):
cdef int area
cdef double[:,:] gPoints
gPoints = memoryview(chunk)
for i in range(0,len(gPoints)):
with nogil:
area = self.getArea2(gPoints[i])
return area
cdef int getArea2(self,double[:] p) nogil :
cdef int area
area = self.rect.getArea()
return area
My suggestion (in the comments) was to ensure that the entire
performCalc
loop wasnogil
. To do this a few changes were needed:The most important of which is swapping
len(gPoints)
forgPoints.shape[0]
which replaces a call to a Python function with an array lookup (also I personally don't thinklen
makes sense for a 2D array).Essentially there's a cost to acquiring and releasing the GIL. You want to make sure that the work done without the GIL is worth the time spent handling it. Simply calculating an area of a rectangle is pretty trivial (two subtractions and a multiplication) and so doesn't really justify the time spent coordinating the GIL between threads - remember that once every loop each thread must (briefly) hold the GIL, during which time no other thread can hold it. However with the whole loop as
nogil
the time spent on administering it becomes tiny.