使用聚類分析選擇最不相似的個體

-1

我想將我的數據聚類爲5個聚類，然後我們需要從所有數據中選擇50個具有最不相似關係的個體。這意味着如果第一個聚類包含100，第二個包含200，第三個包含400，第四個包含200和第五個100，我必須從第一個聚類中選擇5個+從第二個聚類中選擇10個+從第三個+第五名5人。使用聚類分析選擇最不相似的個體

數據例如：

 mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))

我所做的一切，直至現在聚類的數據和排名個人每個集羣內，然後將其導出到excel，並從那裏...... ，已成爲成爲自一個問題我數據變得非常大。

對於如何在R 中應用以前的任何幫助或建議，我將不勝感激。

來源

2013-10-07 hema

你需要幫助瓦特/ R *命令*要得到這個工作，或W/* *的理解，將要使用的過程？這聽起來像是一個關於統計的概念性問題，而不是關於R的編程問題。如果是這樣，這個Q會更好地移植到[交叉驗證]（http://stats.stackexchange.com/）（即統計信息）。 SE）。 – gung

統計學上它非常清楚----我需要關於如何在R – hema

中做到這一點的幫助到目前爲止你有什麼R代碼？ –

我不確定是否是你正在尋找什麼，但也許它可以幫助：

mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1)) 
rownames(mydata) <- paste0("id", 1:100) # some id for identification 


# cluster objects and calculate dissimilarity matrix 
cl <- cutree(hclust(
    sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5) 

# combine results, take sum to aggregate dissimilarity 
res <- data.frame(id=rownames(mydata), 
        cluster=cl, dis_sim=rowSums(as.matrix(sim))) 
# order, lowest overall dissimilarity will be first 
res <- res[order(res$dis_sim), ] 


# split object 
reslist <- split(res, f=res$cluster) 


## takes first three items with highest overall dissim. 
lapply(reslist, tail, n=3) 

## returns id´s with highest overall dissimilarity, top 20% 
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)

來源

2013-10-07 14:27:34 holzben

親愛的Holzben，它真的幫助了謝謝---集羣內還有一件事，如何挑選最接近集羣質心的個體？ ---再次感謝你爲你的漂亮代碼和回覆 – hema

關於你對此有何評論，找到下面的代碼：

懇求注意，代碼可以在美觀和效率方面得到改善。進一步我用了第二個答案，否則它會是混亂。

# calculation of centroits based on: 
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html 
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE)) 
cent <- tapply(mydata, 
     list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean) 
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]]) 


# add up cluster number and data and split by cluster 
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5)) 
newdfl <- split(newdf, f=newdf$cluster) 

# add centroids and drop cluster info 
totaldf <- lapply(1:5, 
      function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]), 
           li=newdfl, cen=cent) 


# calculate new distance to centroits and sort them 
dist_to_cent <- lapply(totaldf, function(x) 
        sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ])) 
dist_to_cent

爲重心的計算出的hclust看到R-Mailinglist

來源

2013-10-07 19:10:36 holzben

感謝您的時間----基於數據示例我認爲使用kmeans並將數據集羣到50個羣集可能會更好（因爲我想選擇50個人）---然後選擇離羣集中心距離最近的一個個體/羣---你怎麼看？很抱歉讓你困擾這麼多問題。 – hema

如果你有興趣分析質心kmeans顯然是一個比層次聚類更自然的選擇......在我的例子中，我開始了層次聚類，因此我也在第二個例子中做了。你的建議聽起來不錯，但我不確定你的總體目標是什麼.... – holzben

使用聚類分析選擇最不相似的個體

回答

相關問題