2017-08-23 28 views
2

我有以下類型的數據/劇情優化修剪的K-means用於聚類具有多個異常值的2D數據?更好的方法?

enter image description here

只看單獨的數據點,這幾乎是不可能判斷其中峯是應該的,但如果在ggplot與2D密度平滑拉我得到這些非常漂亮的山峯,在那裏我可以直觀地計算出〜我想要找到的10組點數。 「有效團體」的確切數量當然是要討論的。

數據在這裏: https://pastebin.com/5wquw7UF

library(ggplot2) 
library(colorRamps) 
library(tclust) 

ggplot(data = df, aes(x = x, y = y)) + 
    stat_density2d(geom = "raster", 
        aes(fill = ..density..), 
        contour = FALSE) + 
    geom_point(col = "white", alpha = 0.1) + 
    scale_x_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    scale_y_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    theme_tufte(base_size = 11, base_family = "Helvetica") + 
    theme(axis.text = element_text(color = "black"), 
      panel.border = element_rect(colour = "black", fill=NA, size=0.7), 
      legend.key.height = unit(2.5,"line"), 
      legend.key.width = unit(1, "line")) + 
    scale_fill_gradientn(name = "Density", 
         colours = matlab.like(1000)) 

我看着修剪集羣,與包tclust。通過擺弄我已經能夠拿出下面的數據。然而,無論我如何擺弄這些參數,我都無法像我看到的那樣,讓團隊看起來「緊張」。特別是第5組似乎潛入它不屬於的地方。第10組也有點奇怪,但隔離到足以丟棄之後。

有沒有更好的方法,或者它只是我不理解如何正確設置參數?

set.seed(2) 

trimmed_cluster <- tclust(
    x = df, 
    k = 10, # 9 
    alpha = 0.1, # 0.1 
    drop.empty.clust = FALSE, 
    equal.weights = TRUE, 
    restr = c("sigma", "eigen"), # sigma 
    restr.fact = 1 
) 

df$cluster <- trimmed_cluster$cluster 

trimmed_cluster_centers <- data.frame(t(trimmed_cluster$centers)) 

df_clustered <- subset(df, cluster != 0) 

ggplot(data = df, aes(x = x, y = y)) + 
    stat_density2d(geom = "raster", 
        aes(fill = ..density..), 
        contour = FALSE) + 
    geom_point(data = df_clustered, aes(x = x, y = y, col = as.factor(cluster))) + 
    geom_text(data = trimmed_cluster_centers, 
       aes(x = x, y = y, label = as.character(1:length(trimmed_cluster_centers$x))), 
       size = 5, 
       fontface = "bold", 
       col = "yellow2") + 
    scale_x_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    scale_y_continuous(expand = c(0,0), 
         limits = c(0,1)) + 
    theme_tufte(base_size = 11, base_family = "Helvetica") + 
    theme(axis.text = element_text(color = "black"), 
      panel.border = element_rect(colour = "black", fill=NA, size=0.7), 
      legend.key.height = unit(0.8,"line"), 
      legend.key.width = unit(0.5, "line")) + 
    scale_fill_gradientn(name = "Density", 
         colours = matlab.like(1000)) + 
     scale_color_brewer(name = "cluster ID", 
        type = "qual", 
        palette = "Spectral") 

enter image description here

+1

**基於密度的**聚類的經典DBSCAN算法呢? –

+0

這是*完美*,我現在幾乎覺得很蠢! –

回答

1

相反的K-手段,我建議你使用DBSCAN density-based clustering

這是一個經過充分測試並且經常使用的聚類算法,用於查找具有任意形狀的密度連接組件的

名稱中的N代表噪聲,因爲算法可以「忽略」不屬於任何羣集的點(由於密度低)。噪音相當強勁,可能對您有所幫助。

0

如果你正在尋找的密度峯值,該裝置轉換算法可能會有所幫助。與任何聚類算法一樣,您可能需要花一些時間調整參數,但我得到的東西看起來似乎很合理。

library(LPCM) 
MS7 = ms(df, 0.07) 
MS7$cluster.center 
     [,1]  [,2] 
1 0.55790817 0.46878846 
2 0.42916901 0.60982702 
3 0.04142821 0.63190748 
4 0.58098385 0.03693459 
5 0.01561478 0.19987934 
6 0.18271326 0.01630580 
7 0.80381893 0.65499869 
8 0.59797721 0.88041362 
9 0.86784436 0.95078057 

Results of Mean shift

相關問題