正如其他人所說,層次聚類需要計算成對距離矩陣,這對於您的情況來說太大而不適合內存。
嘗試使用K-means算法來代替:
numClusters = 4;
T = kmeans(X, numClusters);
或者你可以選擇你的數據的隨機子集,並作爲輸入使用聚類算法。接下來,將聚類中心計算爲每個羣集組的平均值/中值。最後,對於未在子集中選擇的每個實例,只需計算其與每個質心的距離並將其分配給最接近的一個。
這裏的一個示例代碼示出了上面的想法:
%# random data
X = rand(25000, 2);
%# pick a subset
SUBSET_SIZE = 1000; %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);
%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3)); %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length(unique(C)); %# number of clusters found
%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])
%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight
%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
centers(:,i) = accumarray(C, data(:,i), [], @mean);
end
%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
D(:,k) = sum(bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);
%#clustIDX(ind(1:SUBSET_SIZE)) = C;
%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight
尼斯溶液,我喜歡它。 – Donnie 2010-05-31 22:47:48
感謝您的全面回答,我使用層次聚類的原因是我不知道事先需要多少個聚類。在kmeans中,我必須定義從開始,並且由於我的項目的性質,我不可能使用Kmeans。 謝謝反正...... – Hossein 2010-05-31 22:49:48
@Hossein:我改變了代碼,使用'cutoff'值來查找沒有事先指定它的最佳數目的簇... – Amro 2010-05-31 23:09:50