1
我正在嘗試使用羣集,我很驚訝它似乎很慢。我製作了一個包含30個社區的隨機圖,每個社區包含30個節點。社區中的節點具有90%的連接機會,並且不在同一社區中的節點之間的邊緣具有10%的連接機會。我正在測量兩個節點之間的相似度,作爲其鄰居集之間的Jaccard similarity。使用DBSCAN進行羣集出奇的慢
這個玩具的例子在dbscan零件上花費大約15秒,如果我增加節點數量,這個例子會非常迅速地增加。由於總共只有900個節點,這看起來非常緩慢。
from __future__ import division
import numpy as np
from sklearn.cluster import dbscan
import networkx as nx
import matplotlib.pyplot as plt
import time
#Define the Jaccard distance. Following example for clustering with Levenshtein distance from from http://scikit-learn.org/stable/faq.html
def jaccard_distance(x,y):
return 1 - len(neighbors[x].intersection(neighbors[y]))/len(neighbors[x].union(neighbors[y]))
def jaccard_metric(x,y):
i, j = int(x[0]), int(y[0]) # extract indices
return jaccard_distance(i, j)
#Simulate a planted partition graph. The simplest form of community detection benchmark.
num_communities = 30
size_of_communities = 30
print "planted partition"
G = nx.planted_partition_graph(num_communities, size_of_communities, 0.9, 0.1,seed=42)
#Make a hash table of sets of neighbors for each node.
neighbors={}
for n in G:
for nbr in G[n]:
if not (n in neighbors):
neighbors[n] = set()
neighbors[n].add(nbr)
print "Made data"
X= np.arange(len(G)).reshape(-1,1)
t = time.time()
db = dbscan(X, metric = jaccard_metric, eps=0.85, min_samples=2)
print db
print "Clustering took ", time.time()-t, "seconds"
我怎樣才能使這個更具伸縮性較大的節點數目?
這是一個真正偉大的答案。謝謝! – eleanora
根據您想要縮放的節點數量,可能需要更多技巧。例如,保存'D'的內存可能會成爲一個問題,這些稀疏矩陣(由[DBSCAN](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)接受)可能幫助。 –
聽起來不錯。我一定會研究它,因爲我想使用更大的圖。 – eleanora