Usage of threadpoolexecutor in conjunction with cy

2019-08-17 20:30发布

问题:

I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.

This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.

I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.

import psutil
import numpy as np

from concurrent.futures import ThreadPoolExecutor
from functools import partial



cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
    Rectangle(int, int, int, int)
    int x0, y0, x1, y1
    int getArea() nogil


cdef class PyRectangle:
     cdef Rectangle *rect 

def __cinit__(self, int x0, int y0, int x1, int y1):
    self.rect = new Rectangle(x0, y0, x1, y1)

def __dealloc__(self):
    del self.rect

def testThread(self):

    latGrid = np.arange(minLat,maxLat,0.05)
    lonGrid = np.arange(minLon,maxLon,0.05)

    gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
    grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]

    n_jobs = psutil.cpu_count(logical=False)

    chunk = np.array_split(grid_points,n_jobs,axis=0)
    x = ThreadPoolExecutor(max_workers=n_jobs) 

    t0 = time.time()
    func = partial(self.performCalc,maxDistance)
    results = x.map(func,chunk)
    results = np.vstack(list(results))
    t1 = time.time()
    print(t1-t0)

def performCalc(self,maxDistance,chunk):

    cdef int area
    cdef double[:,:] gPoints
    gPoints = memoryview(chunk)
    for i in range(0,len(gPoints)):
        with nogil:
            area =  self.getArea2(gPoints[i])
    return area

cdef int getArea2(self,double[:] p) nogil :
    cdef int area
    area = self.rect.getArea()
    return area

回答1:

My suggestion (in the comments) was to ensure that the entire performCalc loop was nogil. To do this a few changes were needed:

cdef Py_ssize_t i # set type of "i" (although Cython can possibly deduce this anyway)
with nogil:
    for i in range(0,gPoints.shape[0]):
        area =  self.getArea2(gPoints[i])

The most important of which is swapping len(gPoints) for gPoints.shape[0] which replaces a call to a Python function with an array lookup (also I personally don't think len makes sense for a 2D array).

Essentially there's a cost to acquiring and releasing the GIL. You want to make sure that the work done without the GIL is worth the time spent handling it. Simply calculating an area of a rectangle is pretty trivial (two subtractions and a multiplication) and so doesn't really justify the time spent coordinating the GIL between threads - remember that once every loop each thread must (briefly) hold the GIL, during which time no other thread can hold it. However with the whole loop as nogil the time spent on administering it becomes tiny.