我將數據表示爲單個變量的許多不同直方圖。我想確定使用無監督聚類的哪些直方圖是相似的。我也想知道使用的最佳羣集數量。使用地球移動距離的聚類直方圖距離R
我已閱讀Earth Movers Distance度量作爲度量直方圖之間距離的度量,但不知道如何在通用聚類算法中使用該距離度量(例如,k均值)。
主要:我用什麼r軟件包和函數來聚合直方圖?
中學:如何確定「最佳」數量的聚類?
實施例數據集1(3單峯簇):
v1 <- rnorm(n=100, mean = 10, sd = 1) # cluster 1 (around 10)
v2 <- rnorm(n=100, mean = 50, sd = 5) # cluster 2 (around 50)
v3 <- rnorm(n=100, mean = 100, sd = 10) # cluster 3 (around 100)
v4 <- rnorm(n=100, mean = 12, sd = 2) # cluster 1
v5 <- rnorm(n=100, mean = 45, sd = 6) # cluster 2
v6 <- rnorm(n=100, mean = 95, sd = 6) # cluster 3
實施例數據集2(3雙峯簇):
b1 <- c(rnorm(n=100, mean=9, sd=2) , rnorm(n=100, mean=200, sd=20)) # cluster 1 (around 10 and 200)
b2 <- c(rnorm(n=100, mean=50, sd=5), rnorm(n=100, mean=100, sd=10)) # cluster 2 (around 50 and 100)
b3 <- c(rnorm(n=100, mean=99, sd=8), rnorm(n=100, mean=175, sd=17)) # cluster 3 (around 100 and 175)
b4 <- c(rnorm(n=100, mean=12, sd=2), rnorm(n=100, mean=180, sd=40)) # cluster 1
b5 <- c(rnorm(n=100, mean=45, sd=6), rnorm(n=100, mean=80, sd=30)) # cluster 2
b6 <- c(rnorm(n=100, mean=95, sd=6), rnorm(n=100, mean=170, sd=25)) # cluster 3
b7 <- c(rnorm(n=100, mean=10, sd=1), rnorm(n=100, mean=210, sd=30)) # cluster 1 (around 10 and 200)
b8 <- c(rnorm(n=100, mean=55, sd=5), rnorm(n=100, mean=90, sd=15)) # cluster 2 (around 50 and 100)
b9 <- c(rnorm(n=100, mean=89, sd=9), rnorm(n=100, mean=165, sd=20)) # cluster 3 (around 100 and 175)
b10 <- c(rnorm(n=100, mean=8, sd=2), rnorm(n=100, mean=160, sd=30)) # cluster 1
b11 <- c(rnorm(n=100, mean=55, sd=6), rnorm(n=100, mean=110, sd=10)) # cluster 2
b12 <- c(rnorm(n=100, mean=105, sd=6), rnorm(n=100, mean=185, sd=21)) # cluster 3
EMD非常昂貴,所以您需要使用下界和索引來加速您的羣集。 K-means只適用於Bregman分歧,我不認爲EMD是其中之一。 –