I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)
I have a function which takes 2 arguments, a set (we'll call it testSet
), and a list of sets (we'll call that targetSets
). The function then iterates through targetSets
, and computes the length of the intersection with testSet
, adding that value to a list, which is then returned.
Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.
So for a small number of simulations to test, the pure python implementation was taking ~50 secs.
I tried making a cython function, and it worked and it's now running at ~16 secs.
If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)
Here is my Cython implementation in overlapFunc.pyx
def computeOverlap(set testSet, list targetSets):
cdef list obsOverlaps = []
cdef int i, N
cdef set overlap
N = len(targetSets)
for i in range(N):
overlap = testSet & targetSets[i]
if len(overlap) <= 1:
obsOverlaps.append(0)
else:
obsOverlaps.append(len(overlap))
return obsOverlaps
and the setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]
setup(
name = 'computeOverlap function',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
and some code to build some random sets for testing and to time the function. test.py
import numpy as np
from overlapFunc import computeOverlap
import time
def simRandomSet(n):
for i in range(n):
simSet= set(np.random.randint(low=1, high=100, size=50))
yield simSet
if __name__ == '__main__':
np.random.seed(23032014)
targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]
simulatedTestSets = simRandomSet(200)
start = time.time()
for i in simulatedTestSets:
obsOverlaps = computeOverlap(i, targetSet)
print time.time()-start
I tried changing the def at the start of the computerOverlap function, as in:
cdef list computeOverlap(set testSet, list targetSets):
but I get the following warning message when I run the setup.py
script:
'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]
and then when I run something that tries to use the function I get an import Error:
from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap
Thanks in advance for your help,
Cheers,
Davy
In the following line, the extension module name and the filename does not match actual filename.
Replace it with: