優化修剪的K-means用於聚類具有多個異常值的2D數據？更好的方法？

我有以下類型的數據/劇情優化修剪的K-means用於聚類具有多個異常值的2D數據？更好的方法？

只看單獨的數據點，這幾乎是不可能判斷其中峯是應該的，但如果在ggplot與2D密度平滑拉我得到這些非常漂亮的山峯，在那裏我可以直觀地計算出〜我想要找到的10組點數。「有效團體」的確切數量當然是要討論的。

library(ggplot2) 
library(colorRamps) 
library(tclust) 

ggplot(data = df, aes(x = x, y = y)) + 
    stat_density2d(geom = "raster", 
        aes(fill = ..density..), 
        contour = FALSE) + 
    geom_point(col = "white", alpha = 0.1) + 
    scale_x_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    scale_y_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    theme_tufte(base_size = 11, base_family = "Helvetica") + 
    theme(axis.text = element_text(color = "black"), 
      panel.border = element_rect(colour = "black", fill=NA, size=0.7), 
      legend.key.height = unit(2.5,"line"), 
      legend.key.width = unit(1, "line")) + 
    scale_fill_gradientn(name = "Density", 
         colours = matlab.like(1000))

我看着修剪集羣，與包tclust。通過擺弄我已經能夠拿出下面的數據。然而，無論我如何擺弄這些參數，我都無法像我看到的那樣，讓團隊看起來「緊張」。特別是第5組似乎潛入它不屬於的地方。第10組也有點奇怪，但隔離到足以丟棄之後。

有沒有更好的方法，或者它只是我不理解如何正確設置參數？

set.seed(2) 

trimmed_cluster <- tclust(
    x = df, 
    k = 10, # 9 
    alpha = 0.1, # 0.1 
    drop.empty.clust = FALSE, 
    equal.weights = TRUE, 
    restr = c("sigma", "eigen"), # sigma 
    restr.fact = 1 
) 

df$cluster <- trimmed_cluster$cluster 

trimmed_cluster_centers <- data.frame(t(trimmed_cluster$centers)) 

df_clustered <- subset(df, cluster != 0) 

ggplot(data = df, aes(x = x, y = y)) + 
    stat_density2d(geom = "raster", 
        aes(fill = ..density..), 
        contour = FALSE) + 
    geom_point(data = df_clustered, aes(x = x, y = y, col = as.factor(cluster))) + 
    geom_text(data = trimmed_cluster_centers, 
       aes(x = x, y = y, label = as.character(1:length(trimmed_cluster_centers$x))), 
       size = 5, 
       fontface = "bold", 
       col = "yellow2") + 
    scale_x_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    scale_y_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    theme_tufte(base_size = 11, base_family = "Helvetica") + 
    theme(axis.text = element_text(color = "black"), 
      panel.border = element_rect(colour = "black", fill=NA, size=0.7), 
      legend.key.height = unit(0.8,"line"), 
      legend.key.width = unit(0.5, "line")) + 
    scale_fill_gradientn(name = "Density", 
         colours = matlab.like(1000)) + 
     scale_color_brewer(name = "cluster ID", 
        type = "qual", 
        palette = "Spectral")

來源

2017-08-23 komodovaran_

**基於密度的**聚類的經典DBSCAN算法呢？ –

這是*完美*，我現在幾乎覺得很蠢！ –

相反的K-手段，我建議你使用DBSCAN density-based clustering。

這是一個經過充分測試並且經常使用的聚類算法，用於查找具有任意形狀的密度連接組件的。

名稱中的N代表噪聲，因爲算法可以「忽略」不屬於任何羣集的點（由於密度低）。噪音相當強勁，可能對您有所幫助。

來源

2017-08-31 16:09:58

如果你正在尋找的密度峯值，該裝置轉換算法可能會有所幫助。與任何聚類算法一樣，您可能需要花一些時間調整參數，但我得到的東西看起來似乎很合理。

library(LPCM) 
MS7 = ms(df, 0.07) 
MS7$cluster.center 
     [,1]  [,2] 
1 0.55790817 0.46878846 
2 0.42916901 0.60982702 
3 0.04142821 0.63190748 
4 0.58098385 0.03693459 
5 0.01561478 0.19987934 
6 0.18271326 0.01630580 
7 0.80381893 0.65499869 
8 0.59797721 0.88041362 
9 0.86784436 0.95078057

來源

2017-08-23 13:48:37 G5W

優化修剪的K-means用於聚類具有多個異常值的2D數據？更好的方法？

回答

相關問題