2017-04-04 244 views
1

我想比較兩個numpy數組的元素,並刪除其中一個數組的元素,如果座標之間的無核距離小於1並且時間相同。 data_CD4和data_CD8是數組。數組的元素是3D座標列表,時間爲第四元素(numpy.array([[x,y,z,time],[x,y,z,time] .....])。是截止,這裏1Python:比較兩個數組的元素

for i in data_CD8: 
     for m in data_CD4: 
      if distance.euclidean(tuple(i[:3]),tuple(m[:3])) < co and i[3]==m[3] : 
       data_CD8=np.delete(data_CD8, i, 0) 

是否有快速的方法來做到這一點?第一個數組有5000元,第二2000,因此它tooks太多時間。

+0

這應該是'[3]','不[3:]'。 – trincot

+0

如果你想要你也可以使用numpy來進行比較,請查看:http://stackoverflow.com/questions/10580676/comparing-two-numpy-arrays-for-equality-element-wise – LethalProgrammer

+0

正如@trincot指出的那樣它必須是'distance.euclidean(tuple(i [:3]),tuple(m [:3]))''。你能證實嗎? – Divakar

回答

2

下面是使用Scipy's cdist一個量化的方法 -

from scipy.spatial import distance 

# Get eucliden distances between first three cols off data_CD8 and data_CD4 
dists = distance.cdist(data_CD8[:,:3], data_CD4[:,:3]) 

# Get mask of those distances that are within co distance. This sets up the 
# first condition requirement as posted in the loopy version of original code. 
mask1 = dists < co 

# Take the third column off the two input arrays that represent the time values. 
# Get the equality between all time values off data_CD8 against all time values 
# off data_CD4. This sets up the second conditional requirement. 
# We are adding a new axis with None, so that NumPY broadcasting 
# would let us do these comparisons in a vectorized manner. 
mask2 = data_CD8[:,3,None] == data_CD4[:,3] 

# Combine those two masks and look for any match correponding to any 
# element off data_CD4. Since the masks are setup such that second axis 
# represents data_CD4, we need numpy.any along axis=1 on the combined mask. 
# A final inversion of mask is needed as we are deleting the ones that 
# satisfy these requirements. 
mask3 = ~((mask1 & mask2).any(1)) 

# Finally, using boolean indexing to select the valid rows off data_CD8 
out = data_CD8[mask3] 
+0

嗯,當你試試你的代碼時,什麼都不會從數組中刪除。通過我的代碼,data_CD8中一半的elemts被刪除。現在我不能說爲什麼。 – Varlor

+0

@Varlor它創建一個刪除爲'data_CD8_out'的新數組。您是否驗證該數組中的值?或者只是用'data_CD8 = data_CD8 [〜((mask1&mask2).any(1))]''指定回來? – Divakar

+0

因此,data_CD8_out是沒有滿足條件的元素的原始數組?你能否解釋你的代碼?它似乎非常快,我想了解它:) – Varlor

0

,如果你有比較data_CD4中的所有項目到data_CD8 中的項目,同時從data_CD8中刪除數據,可能會更好地在每次迭代中使第二個迭代更小,這當然取決於您最常見的 個案

for m in data_CD4: 
    for i in data_CD8: 
     if distance.euclidean(tuple(i[3:]),tuple(m[3:])) < co and i[3]==m[3] : 
      data_CD8 = np.delete(data_CD8, i, 0) 

基於大O表示法 - 而且由於這是O(n^2) - 我沒有看到一個更快的 解決方案。

2

這應該是一個矢量化的方法。

mask1 = np.sum((data_CD4[:, None, :3] - data_CD8[None, :, :3])**2, axis = -1) < co**2 
mask2 = data_CD4[:, None, 3] == data_CD8[None, :, 3] 
mask3 = np.any(np.logical_and(mask1, mask2), axis = 0) 
data_CD8 = data_CD8[~mask3] 

mask1應該加快距離計算,因爲它不需要平方根調用。 mask1mask2是我們通過np.any擠壓到1d的二維數組。最後的所有刪除操作都可以防止一堆讀/寫操作。

速試驗:

a = np.random.randint(0, 10, (100, 3)) 

b = np.random.randint(0, 10, (100, 3)) 

%timeit cdist(a,b) < 5 #Divakar's answer 
10000 loops, best of 3: 133 µs per loop 

%timeit np.sum((a[None, :, :] - b[:, None, :]) ** 2, axis = -1) < 25 # My answer 
1000 loops, best of 3: 418 µs per loop 

和C編譯的代碼勝,加入不必要的平方根即使。

+0

感謝您的努力。嘗試代碼時出現此錯誤: IndexError:索引3284超出軸0的大小2587 – Varlor

+0

很難說錯誤是什麼,但在'mask3'中嘗試'axis = 0' –

+0

Aaand就像在Divakar的回答中,你需要反轉'mask3' –