How do you optimize this code for nn prediction?

2020-04-23 05:30发布

How do you optimize this code? At the moment it is running to slow for the amount of data that goes through this loop. This code runs 1-nearest neighbor. It will predict the label of the training_element based off the p_data_set

#               [x] ,           [[x1],[x2],[x3]],    [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
    temp = np.array([], dtype=float)
    for p in p_data_set:
        temp = np.append(temp, distance.euclidean(training_element, p))

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]

3条回答
SAY GOODBYE
2楼-- · 2020-04-23 05:43

Use a k-D tree for fast nearest-neighbour lookups, e.g. scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

# I assume that p_data_set is (nsamples, ndims)
tree = cKDTree(p_data_set)

# training_elements is also assumed to be (nsamples, ndims)
dist, idx = tree.query(training_elements, k=1)

predicted_labels = p_label_set[idx]
查看更多
ら.Afraid
3楼-- · 2020-04-23 05:46

You could use distance.cdist to directly get the distances temp and then use .argmin() to get min-index, like so -

minIndex = distance.cdist(training_element[None],p_data_set).argmin()

Here's an alternative approach using np.einsum -

subs = p_data_set - training_element
minIndex =  np.einsum('ij,ij->i',subs,subs).argmin()

Runtime test

Well I was thinking cKDTree would easily beat cdist, but I guess training_element being a 1D array isn't too heavy for cdist and I am seeing it to beat out cKDTree instead by a good 10x+ margin!

Here's the timing results -

In [422]: # Setup arrays
     ...: p_data_set = np.random.randint(0,9,(40000,100))
     ...: training_element = np.random.randint(0,9,(100,))
     ...: 

In [423]: def tree_based(p_data_set,training_element): #@ali_m's soln
     ...:     tree = cKDTree(p_data_set)
     ...:     dist, idx = tree.query(training_element, k=1)
     ...:     return idx
     ...: 
     ...: def einsum_based(p_data_set,training_element):    
     ...:     subs = p_data_set - training_element
     ...:     return np.einsum('ij,ij->i',subs,subs).argmin()
     ...: 

In [424]: %timeit tree_based(p_data_set,training_element)
1 loops, best of 3: 210 ms per loop

In [425]: %timeit einsum_based(p_data_set,training_element)
100 loops, best of 3: 17.3 ms per loop

In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin()
100 loops, best of 3: 14.8 ms per loop
查看更多
Viruses.
4楼-- · 2020-04-23 05:57

Python can be quite fast programming language if used properly. This is my suggestion (faster_prediction):

import numpy as np
import time

def euclidean(a,b):
    return np.linalg.norm(a-b)

def prediction(training_element, p_data_set, p_label_set):
    temp = np.array([], dtype=float)
    for p in p_data_set:
        temp = np.append(temp, euclidean(training_element, p))

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]

def faster_prediction(training_element, p_data_set, p_label_set):    
    temp = np.tile(training_element, (p_data_set.shape[0],1))
    temp = np.sqrt(np.sum( (temp - p_data_set)**2 , 1))    

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]   


training_element = [1,2,3]
p_data_set = np.random.rand(100000, 3)*10
p_label_set = np.r_[0:p_data_set.shape[0]]


t1 = time.time()
result_1 = prediction(training_element, p_data_set, p_label_set)
t2 = time.time()

t3 = time.time()
result_2 = faster_prediction(training_element, p_data_set, p_label_set)
t4 = time.time()


print "Execution time 1:", t2-t1, "value: ", result_1
print "Execution time 2:", t4-t3, "value: ", result_2
print "Speed up: ", (t4-t3) / (t2-t1)

I get the following result on pretty old laptop:

Execution time 1: 21.6033108234 value:  9819
Execution time 2: 0.0176379680634 value:  9819
Speed up:  1224.81857013

which makes me think I must have done some stupid mistake :)

In case of very huge data, where memory might be an issue, I suggest using Cython or implementing function in C++ and wrapping it in python.

查看更多
登录 后发表回答