如何優化此代碼的nn預測？

如何優化此代碼？目前它正在運行以減慢通過該循環的數據量。此代碼運行1個最近的鄰居。它將預測基於training_element的標籤關閉p_data_set如何優化此代碼的nn預測？

#    [x] ,   [[x1],[x2],[x3]], [l1, l2, l3] 
def prediction(training_element, p_data_set, p_label_set): 
    temp = np.array([], dtype=float) 
    for p in p_data_set: 
     temp = np.append(temp, distance.euclidean(training_element, p)) 

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex]

來源

2016-09-21 user

涉及輸入的形狀是什麼？ – Divakar

（100L），（40,000,100L），（40,000） – user

使用快速最近鄰查找一個k-D tree，例如scipy.spatial.cKDTree：

from scipy.spatial import cKDTree 

# I assume that p_data_set is (nsamples, ndims) 
tree = cKDTree(p_data_set) 

# training_elements is also assumed to be (nsamples, ndims) 
dist, idx = tree.query(training_elements, k=1) 

predicted_labels = p_label_set[idx]

來源

2016-09-21 20:17:35

有一天我要學習這個'cKDTree'！ – Divakar

你可以使用distance.cdist獲得直接的距離temp然後用.argmin()獲得分項指數來看，像這樣 -

minIndex = distance.cdist(training_element[None],p_data_set).argmin()

這裏的使用另一種方法np.einsum -

subs = p_data_set - training_element 
minIndex = np.einsum('ij,ij->i',subs,subs).argmin()

運行測試

嗯，我想cKDTree會輕鬆擊敗cdist，但我猜training_element是一個1D陣列不是cdist太重，我看到它通過一個良好的10x+保證金擊敗了cKDTree，而不是！

這裏的時序結果 -

In [422]: # Setup arrays 
    ...: p_data_set = np.random.randint(0,9,(40000,100)) 
    ...: training_element = np.random.randint(0,9,(100,)) 
    ...: 

In [423]: def tree_based(p_data_set,training_element): #@ali_m's soln 
    ...:  tree = cKDTree(p_data_set) 
    ...:  dist, idx = tree.query(training_element, k=1) 
    ...:  return idx 
    ...: 
    ...: def einsum_based(p_data_set,training_element):  
    ...:  subs = p_data_set - training_element 
    ...:  return np.einsum('ij,ij->i',subs,subs).argmin() 
    ...: 

In [424]: %timeit tree_based(p_data_set,training_element) 
1 loops, best of 3: 210 ms per loop 

In [425]: %timeit einsum_based(p_data_set,training_element) 
100 loops, best of 3: 17.3 ms per loop 

In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin() 
100 loops, best of 3: 14.8 ms per loop

來源

2016-09-21 20:13:42 Divakar

如果使用得當，Python可以是相當快的編程語言。這是我的建議（faster_prediction）：

import numpy as np 
import time 

def euclidean(a,b): 
    return np.linalg.norm(a-b) 

def prediction(training_element, p_data_set, p_label_set): 
    temp = np.array([], dtype=float) 
    for p in p_data_set: 
     temp = np.append(temp, euclidean(training_element, p)) 

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex] 

def faster_prediction(training_element, p_data_set, p_label_set):  
    temp = np.tile(training_element, (p_data_set.shape[0],1)) 
    temp = np.sqrt(np.sum((temp - p_data_set)**2 , 1))  

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex] 


training_element = [1,2,3] 
p_data_set = np.random.rand(100000, 3)*10 
p_label_set = np.r_[0:p_data_set.shape[0]] 


t1 = time.time() 
result_1 = prediction(training_element, p_data_set, p_label_set) 
t2 = time.time() 

t3 = time.time() 
result_2 = faster_prediction(training_element, p_data_set, p_label_set) 
t4 = time.time() 


print "Execution time 1:", t2-t1, "value: ", result_1 
print "Execution time 2:", t4-t3, "value: ", result_2 
print "Speed up: ", (t4-t3)/(t2-t1)

我得到很老的筆記本電腦下面的結果：

Execution time 1: 21.6033108234 value: 9819 
Execution time 2: 0.0176379680634 value: 9819 
Speed up: 1224.81857013

這讓我想我一定是做了一些愚蠢的錯誤:)

在數據量非常大的情況下，內存可能會成爲問題，我建議使用Cython或在C++中實現函數並將其封裝在python中。

來源

2016-09-21 20:51:26 GpG

如果需要使用相同數據集進行多次搜索，那麼我會推薦KDTree，如ali_m所述。 – GpG

如何優化此代碼的nn預測？

回答

相關問題