How do you optimize this code for nn prediction?

How do you optimize this code? At the moment it is running to slow for the amount of data that goes through this loop. This code runs 1-nearest neighbor. It will predict the label of the training_element based off the p_data_set

#               [x] ,           [[x1],[x2],[x3]],    [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
    temp = np.array([], dtype=float)
    for p in p_data_set:
        temp = np.append(temp, distance.euclidean(training_element, p))

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]

标签： python performance numpy scipy nearest-neighbor

3条回答

SAY GOODBYE

2楼-- · 2020-04-23 05:43

Use a k-D tree for fast nearest-neighbour lookups, e.g. scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

# I assume that p_data_set is (nsamples, ndims)
tree = cKDTree(p_data_set)

# training_elements is also assumed to be (nsamples, ndims)
dist, idx = tree.query(training_elements, k=1)

predicted_labels = p_label_set[idx]

0人赞添加讨论(0) 举报

ら.Afraid

3楼-- · 2020-04-23 05:46

You could use distance.cdist to directly get the distances temp and then use .argmin() to get min-index, like so -

minIndex = distance.cdist(training_element[None],p_data_set).argmin()

Here's an alternative approach using np.einsum -

subs = p_data_set - training_element
minIndex =  np.einsum('ij,ij->i',subs,subs).argmin()

Runtime test

Well I was thinking cKDTree would easily beat cdist, but I guess training_element being a 1D array isn't too heavy for cdist and I am seeing it to beat out cKDTree instead by a good 10x+ margin!

Here's the timing results -

In [422]: # Setup arrays
     ...: p_data_set = np.random.randint(0,9,(40000,100))
     ...: training_element = np.random.randint(0,9,(100,))
     ...: 

In [423]: def tree_based(p_data_set,training_element): #@ali_m's soln
     ...:     tree = cKDTree(p_data_set)
     ...:     dist, idx = tree.query(training_element, k=1)
     ...:     return idx
     ...: 
     ...: def einsum_based(p_data_set,training_element):    
     ...:     subs = p_data_set - training_element
     ...:     return np.einsum('ij,ij->i',subs,subs).argmin()
     ...: 

In [424]: %timeit tree_based(p_data_set,training_element)
1 loops, best of 3: 210 ms per loop

In [425]: %timeit einsum_based(p_data_set,training_element)
100 loops, best of 3: 17.3 ms per loop

In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin()
100 loops, best of 3: 14.8 ms per loop

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2020-04-23 05:57

Python can be quite fast programming language if used properly. This is my suggestion (faster_prediction):

import numpy as np
import time

def euclidean(a,b):
    return np.linalg.norm(a-b)

def prediction(training_element, p_data_set, p_label_set):
    temp = np.array([], dtype=float)
    for p in p_data_set:
        temp = np.append(temp, euclidean(training_element, p))

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]

def faster_prediction(training_element, p_data_set, p_label_set):    
    temp = np.tile(training_element, (p_data_set.shape[0],1))
    temp = np.sqrt(np.sum( (temp - p_data_set)**2 , 1))    

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]   


training_element = [1,2,3]
p_data_set = np.random.rand(100000, 3)*10
p_label_set = np.r_[0:p_data_set.shape[0]]


t1 = time.time()
result_1 = prediction(training_element, p_data_set, p_label_set)
t2 = time.time()

t3 = time.time()
result_2 = faster_prediction(training_element, p_data_set, p_label_set)
t4 = time.time()


print "Execution time 1:", t2-t1, "value: ", result_1
print "Execution time 2:", t4-t3, "value: ", result_2
print "Speed up: ", (t4-t3) / (t2-t1)

I get the following result on pretty old laptop:

Execution time 1: 21.6033108234 value:  9819
Execution time 2: 0.0176379680634 value:  9819
Speed up:  1224.81857013

which makes me think I must have done some stupid mistake :)

In case of very huge data, where memory might be an issue, I suggest using Cython or implementing function in C++ and wrapping it in python.

0人赞添加讨论(0) 举报

How do you optimize this code for nn prediction?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间