How do you optimize this code? At the moment it is running to slow for the amount of data that goes through this loop. This code runs 1-nearest neighbor. It will predict the label of the training_element based off the p_data_set
# [x] , [[x1],[x2],[x3]], [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
temp = np.array([], dtype=float)
for p in p_data_set:
temp = np.append(temp, distance.euclidean(training_element, p))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
Use a k-D tree for fast nearest-neighbour lookups, e.g.
scipy.spatial.cKDTree
:You could use
distance.cdist
to directly get the distancestemp
and then use.argmin()
to get min-index, like so -Here's an alternative approach using
np.einsum
-Runtime test
Well I was thinking
cKDTree
would easily beatcdist
, but I guesstraining_element
being a1D
array isn't too heavy forcdist
and I am seeing it to beat outcKDTree
instead by a good10x+
margin!Here's the timing results -
Python can be quite fast programming language if used properly. This is my suggestion (faster_prediction):
I get the following result on pretty old laptop:
which makes me think I must have done some stupid mistake :)
In case of very huge data, where memory might be an issue, I suggest using Cython or implementing function in C++ and wrapping it in python.