2017-08-01 48 views
1

我想比較不同大小的集合S的分區/聚類(P1和P2)。例如:python scikit-learn實現互信息不適用於不同大小的分區

S = [1, 2, 3, 4, 5, 6] 
P1 = [[1, 2], [3,4], [5,6]] 
P2 = [ [1,2,3,4], [5, 6]] 

從我讀互信息可能是一種方法,它是在實施scikit學習。從定義,它並沒有說明該分區必須是同樣大小的(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l

然而,當我試圖實現我的代碼,我得到的錯誤,由於不同的尺寸。

from sklearn import metrics 
P1 = [[1, 2], [3,4], [5,6]] 
P2 = [ [1,2,3,4], [5, 6]] 
metrics.mutual_info_score(P1,P2) 


ValueErrorTraceback (most recent call last) 
<ipython-input-183-d5cb8d32ce7d> in <module>() 
     2 P2 = [ [1,2,3,4], [5, 6]] 
     3 
----> 4 metrics.mutual_info_score(P1,P2) 

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency) 
    556  """ 
    557  if contingency is None: 
--> 558   labels_true, labels_pred = check_clusterings(labels_true, labels_pred) 
    559   contingency = contingency_matrix(labels_true, labels_pred) 
    560  contingency = np.array(contingency, dtype='float') 

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred) 
    34  if labels_true.ndim != 1: 
    35   raise ValueError(
---> 36    "labels_true must be 1D: shape is %r" % (labels_true.shape,)) 
    37  if labels_pred.ndim != 1: 
    38   raise ValueError(

ValueError: labels_true must be 1D: shape is (3, 2) 

有一種形式使用scikit-learn和互信息來看看這個分區有多接近?否則,是否有一個不使用互信息?

回答

0

錯誤是以信息傳遞給函數的形式。正確的形式是給出一個列表標籤爲全局集合中的每個元素分區。在這種情況下,每個元素都有一個標籤每個標籤應該對應於它所屬的集羣,因此具有相同標籤的元素在同一個集羣中。爲了解決這個例子:

S = [1, 2, 3, 4, 5, 6] 
P1 = [[1, 2], [3,4], [5,6]] 
P2 = [ [1,2,3,4], [5, 6]] 
labs_1 = [ 1, 1, 2, 2, 3, 3] 
labs_2 = [1, 1, 1, 1, 2, 2] 
metrics.mutual_info_score(labs_1, labs_2) 

的回答則是:

0.636514168294813 

如果我們要計算原來給定的,然後可以使用下面的代碼分區的格式互信息:

from sklearn import metrics 
from __future__ import division 
import numpy as np 

S = [1, 2, 3, 4, 5, 6] 
P1 = [[1, 2], [3,4], [5,6]] 
P2 = [ [1,2,3,4], [5, 6]] 
set_partition1 = [set(p) for p in P1] 
set_partition2 = [set(p) for p in P2] 

def prob_dist(clustering, cluster, N): 
    return len(clustering[cluster])/N 

def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N): 
    ''' 
    N(int): total number of elements. 
    clustering1(list): first partition 
    clustering2(list): second partition 
    cluster1(int): index of cluster of the first partition 
    cluster2(int): index of cluster of second partition 
    ''' 
    c1 = clustering1[cluster1] 
    c2 = clustering2[cluster2] 
    n_ij = len(set(c1).intersection(c2)) 
    return n_ij/N 

def mutual_info(clustering1, clustering2, N): 
    ''' 
    clustering1(list): first partition 
    clustering2(list): second partition 
    Note for it to work division from __future__ must be imported 
    ''' 
    n_clas = len(clustering1) 
    n_com = len(clustering2) 
    mutual_info = 0 
    for i in range(n_clas): 
     for j in range(n_com): 
      p_i = prob_dist(clustering1, i, N) 
      p_j = prob_dist(clustering2, j, N) 
      R_ij = prob_joint_dist(clustering1, clustering2, i, j, N) 
      if R_ij > 0: 
       mutual_info += R_ij*np.log(R_ij/(p_i * p_j)) 
    return mutual_info 

mutual_info(set_partition1, set_partition2, len(S)) 

它給出了相同的答案:

0.63651416829481278 

請注意,我們使用自然對數而不是log2。代碼可以很容易地適應。