Python集羣'純度'度量

我正在使用sklearn.mixture的Gaussian Mixture Model (GMM)來執行我的數據集的聚類。Python集羣'純度'度量

我可以使用函數score()來計算模型下的對數概率。

但是，我正在尋找一種稱爲'純度'的度量標準，它在中定義。

我該如何在Python中實現它？我當前的實現看起來是這樣的：

from sklearn.mixture import GMM 

# X is a 1000 x 2 array (1000 samples of 2 coordinates). 
# It is actually a 2 dimensional PCA projection of data 
# extracted from the MNIST dataset, but this random array 
# is equivalent as far as the code is concerned. 
X = np.random.rand(1000, 2) 

clusterer = GMM(3, 'diag') 
clusterer.fit(X) 
cluster_labels = clusterer.predict(X) 

# Now I can count the labels for each cluster.. 
count0 = list(cluster_labels).count(0) 
count1 = list(cluster_labels).count(1) 
count2 = list(cluster_labels).count(2)

但我可以通過每個集羣，以計算混淆矩陣不是循環（根據本question）

來源

2015-12-02 Kuka

該論文是相當不透明。 [這個答案]（http://stats.stackexchange.com/a/154379/89612）上的交叉驗證簡化了一下程序。 – kdbanman

請發佈您到目前爲止的代碼，並告訴我們所涉及的數據結構。 – kdbanman

目前，我的代碼是： '從sklearn.mixture進口GMM 人聚類= GMM（5 '診斷'） clusterer.fit（X） cluster_labels = clusterer.predict（X）' 我看到，在爲了計算純度我需要混淆矩陣。現在，我的問題是，我無法遍歷每個羣集，並計算每個類別分爲多少個對象。 – Kuka

sklearn沒有實現集羣純度指標。您有2個選項：

您自己使用sklearn數據結構實施測量。 This和this有一些用於測量純度的python源代碼，但是您的數據或函數體需要適應彼此的兼容性。
使用（不太成熟的）PML庫，它實現了簇的純度。

來源

2015-12-02 16:29:02 kdbanman

一個很晚的貢獻。

你可以嘗試實現它這個樣子，很像在這個gist

from sklearn.metrics import accuracy_score 
import numpy as np 

def purity_score(y_true, y_pred): 
    # matrix which will hold the majority-voted labels 
    y_labeled_voted = np.zeros(y_true.shape) 
    labels = np.unique(y_true) 
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bin 
    # the bigger being excluded [bin_i, bin_i+1[ 
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0) 

    for cluster in np.unique(y_pred): 
     hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins) 
     # Find the most present label in the cluster 
     winner = np.argmax(hist) 
     y_labeled_voted[y_pred==cluster] = winner 

    return accuracy_score(y_true, y_labeled_voted)

來源

2017-07-06 00:21:37 David

Python集羣'純度'度量

回答

相關問題