見@agartland回答—您可以在sklearn.metrics.pairwise.pairwise_distances指定n_jobs
或n_jobs
參數尋找聚類算法在sklearn.cluster。例如, sklearn.cluster.KMeans
。
不過,如果你覺得冒險,你可以實現自己的計算。例如,如果你需要一維距離矩陣爲scipy.cluster.hierarchy.linkage
你可以使用:
#!/usr/bin/env python3
from multiprocessing import Pool
import numpy as np
from time import time as ts
data = np.zeros((100,10)) # YOUR data: np.array[n_samples x m_features]
n_processes = 4 # YOUR number of processors
def metric(a, b): # YOUR dist function
return np.sum(np.abs(a-b))
n = data.shape[0]
k_max = n * (n - 1) // 2 # maximum elements in 1D dist array
k_step = n ** 2 // 500 # ~500 bulks
dist = np.zeros(k_max) # resulting 1D dist array
def proc(start):
dist = []
k1 = start
k2 = min(start + k_step, k_max)
for k in range(k1, k2):
# get (i, j) for 2D distance matrix knowing (k) for 1D distance matrix
i = int(n - 2 - int(np.sqrt(-8 * k + 4 * n * (n - 1) - 7)/2.0 - 0.5))
j = int(k + i + 1 - n * (n - 1)/2 + (n - i) * ((n - i) - 1)/2)
# store distance
a = data[i, :]
b = data[j, :]
d = metric(a, b)
dist.append(d)
return k1, k2, dist
ts_start = ts()
with Pool(n_processes) as pool:
for k1, k2, res in pool.imap_unordered(proc, range(0, k_max, k_step)):
dist[k1:k2] = res
print("{:.0f} minutes, {:,}..{:,} out of {:,}".format(
(ts() - ts_start)/60, k1, k2, k_max))
print("Elapsed %.0f minutes" % ((ts() - ts_start)/60))
print("Saving...")
np.savez("dist.npz", dist=dist)
print("DONE")
只要你知道,scipy.cluster.hierarchy.linkage
執行不平行和它的複雜性至少是O(N * N)。我不確定scipy
是否具有此功能的並行實現。
是不是這是由'scipy.spatial.distance.cdist(XA,XB,'餘弦')' – TJD
完成它實際上,但這些方法是並行的?我目前使用'pdist',但它需要很長時間。 – dkar
沒有並行化,但可能要快得多,因爲你會在本機C代碼而不是python中完成更多工作。 – TJD