2013-10-29 30 views
1

我需要使用numexpr重寫此代碼,它計算矩陣數據[行x列]和向量[1 x列]的歐幾里得範數矩陣。歐幾里得範數使用numexpr

d = ((data-vec)**2).sum(axis=1) 

該怎麼辦?也許還有另一種更快的方法?

我使用hdf5和數據矩陣來源於它的問題。 例如,此代碼給出錯誤:對象未對齊。

#naive numpy solution, can be parallel? 
def test_bruteforce_knn(): 
    h5f = tables.open_file(fileName) 

    t0= time.time() 
    d = np.empty((rows*batches,)) 
    for i in range(batches): 
     d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1) 
    print (time.time()-t0) 
    ndx = d.argsort() 
    print ndx[:k] 

    h5f.close() 

#using some tricks (don't work error: objects are not aligned) 
def test_bruteforce_knn(): 
    h5f = tables.open_file(fileName) 

    t0= time.time() 
    d = np.empty((rows*batches,)) 
    for i in range(batches): 
     d[i*rows:(i+1)*rows] = (np.einsum('ij,ij->i', h5f.root.carray[i*rows:(i+1)*rows], 
     h5f.root.carray[i*rows:(i+1)*rows]) 
     + np.dot(vec, vec) 
     -2 * np.dot(h5f.root.carray[i*rows:(i+1)*rows], vec)) 
    print (time.time()-t0) 
    ndx = d.argsort() 
    print ndx[:k] 

    h5f.close() 

使用numexpr:似乎numexpr不明白h5f.root.carray [我*行:第(i + 1)*行]必須重新分配?

import numexpr as ne 

def test_bruteforce_knn(): 
    h5f = tables.open_file(fileName) 

    t0= time.time() 
    d = np.empty((rows*batches,)) 
    for i in range(batches): 
     ne.evaluate("sum((h5f.root.carray[i*rows:(i+1)*rows] - vec) ** 2, axis=1)") 
    print (time.time()-t0) 
    ndx = d.argsort() 
    print ndx[:k] 

    h5f.close() 

回答

4

有隻使用NumPy的,這是在使用一個潛在的快速方法(對於非常大的陣列)scikit學習:

def squared_row_norms(X): 
    # From http://stackoverflow.com/q/19094441/166749 
    return np.einsum('ij,ij->i', X, X) 

def squared_euclidean_distances(data, vec): 
    data2 = squared_row_norms(data) 
    vec2 = squared_row_norms(vec) 
    d = np.dot(data, vec.T).ravel() 
    d *= -2 
    d += data2 
    d += vec2 
    return d 

這是基於這樣的事實,(X - Y)² = x 2 + y 2 - 2xy,即使對於矢量。

測試:

>>> data = np.random.randn(10, 40) 
>>> vec = np.random.randn(1, 40) 
>>> ((data - vec) ** 2).sum(axis=1) 
array([ 96.75712686, 69.45894306, 100.71998244, 80.97797154, 
     84.7 , 82.28910021, 67.48309433, 81.94813371, 
     64.68162331, 77.43265692]) 
>>> squared_euclidean_distances(data, vec) 
array([ 96.75712686, 69.45894306, 100.71998244, 80.97797154, 
     84.7 , 82.28910021, 67.48309433, 81.94813371, 
     64.68162331, 77.43265692]) 
>>> from sklearn.metrics.pairwise import euclidean_distances 
>>> euclidean_distances(data, vec, squared=True).ravel() 
array([ 96.75712686, 69.45894306, 100.71998244, 80.97797154, 
     84.7 , 82.28910021, 67.48309433, 81.94813371, 
     64.68162331, 77.43265692]) 

簡介:

>>> data = np.random.randn(1000, 40) 
>>> vec = np.random.randn(1, 40) 
>>> %timeit ((data - vec)**2).sum(axis=1) 
10000 loops, best of 3: 114 us per loop 
>>> %timeit squared_euclidean_distances(data, vec) 
10000 loops, best of 3: 52.5 us per loop 

使用numexpr也是可能的,但它似乎並沒有給予任何加速的1000點(在10000,這是不更好):

>>> %timeit ne.evaluate("sum((data - vec) ** 2, axis=1)") 
10000 loops, best of 3: 142 us per loop 
+0

謝謝,squared_euclidean_distances肯定工作更快,也看看我的更新,不能管理numexpr版本到工作。 – mrgloom