2015-11-06 50 views
4

我用sklearn Kmeans聚類了一個數據樣本(400 k個樣本,維數= 205,200個星團)。Numpy - Clustering - Distance - Vectorisation

對於每個羣集,我想知道羣集中心與羣集最遠的樣本之間的最大距離,以便了解有關羣集「大小」的信息。 這裏是我的代碼:

import numpy as np 
import scipy.spatial.distance as spd 
diam = np.empty([200]) 
for i in range(200): 
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max() 

「種子」 是聚類中心(200x206)。 「種子」的第一列包含羣集內的樣本數量(此處不相關)。

「數據」是樣本(400kx206)。數據的第一列包含集羣編號。

問題:這是使用循環(不是「numpy」)完成的。是否有可能「引導」它?

+0

這實際上是相當合理的代碼。你的for循環相對於'cdist'內完成的計算量相對較小。由於'cdist'是一個相當優化的速度,收益不會很大。 – Daniel

+1

@Ophion - 雖然可以避免重複的線性搜索data [:, 0] == i,從O(n ** 2)到O(n log(n))甚至O(n )。 – 2015-11-06 20:19:23

+0

@moarningsun是的,但是可能的和可用的是兩個不同的東西,特別是考慮到O(n * m)而不是O(n^2)和n << m。到目前爲止,沒有任何解決方案比OP更快,並且所有解決方案都有更多的內存開銷。 – Daniel

回答

1

這裏有一個量化的方法 -

# Sort data w.r.t. col-0 
data_sorted = data[data[:, 0].argsort()] 

# Get counts of unique tags in col-0 of data and repeat seed accordingly. 
# Thus, we would have an extended version of seed that matches data's shape. 
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0) 

# Get euclidean distances between extended seed version and sorted data 
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1)) 

# Get positions of shifts in col-0 of sorted data 
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1) 

# Final piece of puzzle is to get tag based maximum values from dists, 
# where each tag is unique number in col-0 of data 
diam_out = np.maximum.reduceat(dists,shift_idx) 

運行測試和驗證輸出 -

定義功能:

def loopy_cdist(seed,data): # from OP's solution 
    N = seed.shape[0] 
    diam = np.empty(N) 
    for i in range(N): 
     diam[i]=spd.cdist(seed[np.newaxis,i,1:], data[data[:,0]==i][:,1:]).max() 
    return diam 

def vectorized_repeat_reduceat(seed,data): # from this solution 
    data_sorted = data[data[:, 0].argsort()] 
    seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0) 
    dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1)) 
    shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1) 
    return np.maximum.reduceat(dists,shift_idx) 

def vectorized_indexing_maxat(seed,data): # from @moarningsun's solution 
    seed_repeated = seed[data[:,0]] 
    dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1)) 
    diam = np.zeros(len(seed)) 
    np.maximum.at(diam, data[:,0], dist_to_center) 
    return diam 

驗證輸出:

In [417]: # Inputs 
    ...: seed = np.random.rand(20,20) 
    ...: data = np.random.randint(0,20,(40000,20)) 
    ...: 

In [418]: np.allclose(loopy_cdist(seed,data),vectorized_repeat_reduceat(seed,data)) 
Out[418]: True 

In [419]: np.allclose(loopy_cdist(seed,data),vectorized_indexing_maxat(seed,data)) 
Out[419]: True 

運行時:

In [420]: %timeit loopy_cdist(seed,data) 
10 loops, best of 3: 35.9 ms per loop 

In [421]: %timeit vectorized_repeat_reduceat(seed,data) 
10 loops, best of 3: 28.9 ms per loop 

In [422]: %timeit vectorized_indexing_maxat(seed,data) 
10 loops, best of 3: 24.1 ms per loop 
1

非常相似想法@Divakar,但不必進行排序:

seed_repeated = seed[data[:,0]] 
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1)) 

diam = np.zeros(len(seed)) 
np.maximum.at(diam, data[:,0], dist_to_center) 

ufunc.at被稱爲是緩慢的,所以這將是有趣的,看看這是更快。

+1

智能移動與索引照顧「重複」! – Divakar

2

我們可以在索引方面稍微聰明些,節省大約4倍的成本。

首先讓建立正確的形狀的一些數據:

seed = np.random.randint(0, 100, (200,206)) 
data = np.random.randint(0, 100, (4e5,206)) 
seed[:, 0] = np.arange(200) 
data[:, 0] = np.random.randint(0, 200, 4e5) 
diam = np.empty(200) 

原答覆的時間:

%%timeit 
for i in range(200): 
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max() 

1 loops, best of 3: 1.35 s per loop 

moarningsun的回答是:

%%timeit 
seed_repeated = seed[data[:,0]] 
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1)) 
diam = np.zeros(len(seed)) 
np.maximum.at(diam, data[:,0], dist_to_center) 

1 loops, best of 3: 1.33 s per loop 

Divakar的回答是:

%%timeit 
data_sorted = data[data[:, 0].argsort()] 
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0) 
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1)) 
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1) 
diam_out = np.maximum.reduceat(dists,shift_idx) 

1 loops, best of 3: 1.65 s per loop 

正如我們所看到的,除了更大的內存佔用之外,還沒有真正獲得任何矢量化解決方案。爲了避免這種情況,我們需要返回到原來的答案,這是真的做這些事情的正確方法,而是試圖減少索引量:

%%timeit 
idx = data[:,0].argsort() 
bins = np.bincount(data[:,0]) 
counter = 0 
for i in range(200): 
    data_slice = idx[counter: counter+bins[i]] 
    diam[i] = spd.cdist(seed[None, i, 1:], data[data_slice, 1:]).max() 
    counter += bins[i] 

1 loops, best of 3: 281 ms per loop 

仔細檢查答案:

np.allclose(diam, dam_out) 
True 

這是假設python循環不好的問題。他們往往是,但不是在所有情況下。