在R中聚類海浪數據

我已經在R中使用不同的聚類方法（kmeans，hclust，agnes，funny）對風暴能量數據進行聚類，但即使很容易爲我的工作選擇最佳方法，但我需要一種計算（而不是理論）方法，通過它們的結果來比較和評估方法。你相信有什麼東西嗎？在R中聚類海浪數據

由於提前，

來源

2016-08-29 Marz

我記得有人使用Dunn索引來評估聚類算法。見http://artax.karlin.mff.cuni.cz/r-help/library/clValid/html/dunn.html –

嗨。也許最好在Cross Validated上提出你的問題，這是關於機器學習等問題的平臺。如果您正在尋找R中的軟件包進行集羣，請嘗試使用插入符號包。 caret包含許多用標準包裝進行聚類的不同方法，因此比較結果更容易。 – PhiSeu

感謝您的建議，我會仔細研究以上！ – Marz

謝謝你的問題，我學到了你不能從factoextra包

使用kmeans演示從here

# Load and scale the dataset 
data("USArrests") 
DF <- scale(USArrests) 

When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob) 

library("factoextra") 

# Enhanced k-means clustering 
res.km <- eclust(DF, "kmeans") 


# Gap statistic plot 
fviz_gap_stat(res.km$gap_stat)

計算使用eclust功能集羣的最佳數目

聚類功能比較：

您可以使用所有可用的方法和計算集羣的最佳數目與：

clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana") 


resultList <- sapply(clusterFuncList,function(x) { 

cat("Begin clustering for function:",x,"\n") 

#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE 
clustObj = eclust(DF, x,graph=FALSE) 

#return optimal number of clusters for each clustering function 

cat("End clustering for function:",x,"\n\n\n") 

resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE) 

}) 

# >resultList 
    # clustFunc optimalNumbClusters 
# 1 kmeans     4 
# 2  pam     4 
# 3  clara     5 
# 4  fanny     5 
# 5 hclust     4 
# 6  agnes     4 
# 7  diana     4

間隙統計即優度配合措施：

「差距統計量」用作聚類算法的擬合優度的度量，參見paper

對於固定數量的用戶定義的簇，我們可以從cluster封裝clusGap功能比較間隙統計每個聚類算法：

numbClusters = 5 

library(cluster) 

clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny") 

gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) { 

cat("Begin clustering for function:",x,"\n") 

set.seed(42) 
#For each clustering function compute gap statistic 

gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters) 

gapStatVec= round(gapStatBoot$Tab[,"gap"],3) 


gapStat_at_AllClusters = paste(gapStatVec,collapse=",") 

gapStat_at_chosenCluster = gapStatVec[numbClusters] 

#return gap statistic for each clustering function 

cat("End clustering for function:",x,"\n\n\n") 

resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE) 

})) 

# >gapStatList 
# clustFunc  gapStat_at_AllClusters gapStat_at_chosenCluster 
#1 kmeans 0.184,0.235,0.264,0.233,0.27     0.270 
#2  pam 0.181,0.253,0.274,0.307,0.303     0.303 
#3  clara 0.181,0.253,0.276,0.311,0.315     0.315 
#4  fanny 0.181,0.23,0.313,0.351,0.478     0.478

上面的表具有在從K均clutser每個算法的間隙統計量= 1至5.列3,gapStat_at_chosenCluster在k = 5簇處具有間隙統計量。統計越低，分區越好，因此，在k = 5個簇中，kmeans相對於USArrests數據集執行更好的

來源

2016-08-29 11:01:06 OdeToMyFiddle

感謝您的回答，但我認爲您提出了另一種製作羣集的方法（某事我已經完成了）。這也是固定的，我需要5個羣集，所以我沒有尋找最佳數量的羣集。除非有意義運行此代碼才能找到5個集羣更好的方法！ – Marz

在R中聚類海浪數據

回答

相關問題