I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)
I have a function which takes 2 arguments, a set (we'll call it testSet
), and a list of sets (we'll call that targetSets
). The function then iterates through targetSets
, and computes the length of the intersection with testSet
, adding that value to a list, which is then returned.
Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.
So for a small number of simulations to test, the pure python implementation was taking ~50 secs.
I tried making a cython function, and it worked and it's now running at ~16 secs.
If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)
Here is my Cython implementation in overlapFunc.pyx
def computeOverlap(set testSet, list targetSets):
cdef list obsOverlaps = []
cdef int i, N
cdef set overlap
N = len(targetSets)
for i in range(N):
overlap = testSet & targetSets[i]
if len(overlap) <= 1:
obsOverlaps.append(0)
else:
obsOverlaps.append(len(overlap))
return obsOverlaps
and the setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]
setup(
name = 'computeOverlap function',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
and some code to build some random sets for testing and to time the function. test.py
import numpy as np
from overlapFunc import computeOverlap
import time
def simRandomSet(n):
for i in range(n):
simSet= set(np.random.randint(low=1, high=100, size=50))
yield simSet
if __name__ == '__main__':
np.random.seed(23032014)
targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]
simulatedTestSets = simRandomSet(200)
start = time.time()
for i in simulatedTestSets:
obsOverlaps = computeOverlap(i, targetSet)
print time.time()-start
I tried changing the def at the start of the computerOverlap function, as in:
cdef list computeOverlap(set testSet, list targetSets):
but I get the following warning message when I run the setup.py
script:
'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]
and then when I run something that tries to use the function I get an import Error:
from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap
Thanks in advance for your help,
Cheers,
Davy