2016-09-21 21 views
1

如何優化此代碼? 目前它正在運行以減慢通過該循環的數據量。此代碼運行1個最近的鄰居。它將預測基於training_element的標籤關閉p_data_set如何優化此代碼的nn預測?

#    [x] ,   [[x1],[x2],[x3]], [l1, l2, l3] 
def prediction(training_element, p_data_set, p_label_set): 
    temp = np.array([], dtype=float) 
    for p in p_data_set: 
     temp = np.append(temp, distance.euclidean(training_element, p)) 

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex] 
+0

涉及輸入的形狀是什麼? – Divakar

+0

(100L),(40,000,100L),(40,000) – user

回答

2

使用快速最近鄰查找一個k-D tree,例如scipy.spatial.cKDTree

from scipy.spatial import cKDTree 

# I assume that p_data_set is (nsamples, ndims) 
tree = cKDTree(p_data_set) 

# training_elements is also assumed to be (nsamples, ndims) 
dist, idx = tree.query(training_elements, k=1) 

predicted_labels = p_label_set[idx] 
+0

有一天我要學習這個'cKDTree'! – Divakar

1

你可以使用distance.cdist獲得直接的距離temp然後用.argmin()獲得分項指數來看,像這樣 -

minIndex = distance.cdist(training_element[None],p_data_set).argmin() 

這裏的使用另一種方法np.einsum -

subs = p_data_set - training_element 
minIndex = np.einsum('ij,ij->i',subs,subs).argmin() 

運行測試

嗯,我想cKDTree會輕鬆擊敗cdist,但我猜training_element是一個1D陣列不是cdist太重,我看到它通過一個良好的10x+保證金擊敗了cKDTree,而不是!

這裏的時序結果 -

In [422]: # Setup arrays 
    ...: p_data_set = np.random.randint(0,9,(40000,100)) 
    ...: training_element = np.random.randint(0,9,(100,)) 
    ...: 

In [423]: def tree_based(p_data_set,training_element): #@ali_m's soln 
    ...:  tree = cKDTree(p_data_set) 
    ...:  dist, idx = tree.query(training_element, k=1) 
    ...:  return idx 
    ...: 
    ...: def einsum_based(p_data_set,training_element):  
    ...:  subs = p_data_set - training_element 
    ...:  return np.einsum('ij,ij->i',subs,subs).argmin() 
    ...: 

In [424]: %timeit tree_based(p_data_set,training_element) 
1 loops, best of 3: 210 ms per loop 

In [425]: %timeit einsum_based(p_data_set,training_element) 
100 loops, best of 3: 17.3 ms per loop 

In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin() 
100 loops, best of 3: 14.8 ms per loop 
0

如果使用得當,Python可以是相當快的編程語言。 這是我的建議(faster_prediction):

import numpy as np 
import time 

def euclidean(a,b): 
    return np.linalg.norm(a-b) 

def prediction(training_element, p_data_set, p_label_set): 
    temp = np.array([], dtype=float) 
    for p in p_data_set: 
     temp = np.append(temp, euclidean(training_element, p)) 

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex] 

def faster_prediction(training_element, p_data_set, p_label_set):  
    temp = np.tile(training_element, (p_data_set.shape[0],1)) 
    temp = np.sqrt(np.sum((temp - p_data_set)**2 , 1))  

    minIndex = np.argmin(temp) 
    return p_label_set[minIndex] 


training_element = [1,2,3] 
p_data_set = np.random.rand(100000, 3)*10 
p_label_set = np.r_[0:p_data_set.shape[0]] 


t1 = time.time() 
result_1 = prediction(training_element, p_data_set, p_label_set) 
t2 = time.time() 

t3 = time.time() 
result_2 = faster_prediction(training_element, p_data_set, p_label_set) 
t4 = time.time() 


print "Execution time 1:", t2-t1, "value: ", result_1 
print "Execution time 2:", t4-t3, "value: ", result_2 
print "Speed up: ", (t4-t3)/(t2-t1) 

我得到很老的筆記本電腦下面的結果:

Execution time 1: 21.6033108234 value: 9819 
Execution time 2: 0.0176379680634 value: 9819 
Speed up: 1224.81857013 

這讓我想我一定是做了一些愚蠢的錯誤:)

在數據量非常大的情況下,內存可能會成爲問題,我建議使用Cython或在C++中實現函數並將其封裝在python中。

+0

如果需要使用相同數據集進行多次搜索,那麼我會推薦KDTree,如ali_m所述。 – GpG