Consider the following class:
class SquareErrorDistance(object):
def __init__(self, dataSample):
variance = var(list(dataSample))
if variance == 0:
self._norm = 1.0
else:
self._norm = 1.0 / (2 * variance)
def __call__(self, u, v): # u and v are floats
return (u - v) ** 2 * self._norm
I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__
function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?
If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.
I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.
So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?