我如何聚簇元組列表（標籤，概率）列表？ - python

我有一堆文本，它們被分類到不同的類別中，然後每個文檔都以每個標籤的概率標記爲0,1或2。我如何聚簇元組列表（標籤，概率）列表？ - python

[ "this is a foo bar", 
    "bar bar black sheep", 
    "sheep is an animal" 
    "foo foo bar bar" 
    "bar bar sheep sheep" ]

在管道前面的工具返回的元組作爲這樣的列表的列表，在所述外列表中的每個元素是排序文檔。

[ [(0,0.3), (1,0.5), (2,0.1)], 
    [(0,0.5), (1,0.3), (2,0.3)], 
    [(0,0.4), (1,0.4), (2,0.5)], 
    [(0,0.3), (1,0.7), (2,0.2)], 
    [(0,0.2), (1,0.6), (2,0.1)] ]

我需要它，看看哪些標籤中的每個元組的名單是最可能的實現：我只能用這樣的事實，我知道每一個文件的標籤0，1或2，其概率爲這樣的工作：

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] , 
    [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] , 
    [[(0,0.4), (1,0.4), (2,0.5)]] ]

作爲另一個例子：

[in]：

[ [(0,0.7), (1,0.2), (2,0.4)], 
    [(0,0.5), (1,0.9), (2,0.3)], 
    [(0,0.3), (1,0.8), (2,0.4)], 
    [(0,0.8), (1,0.2), (2,0.2)], 
    [(0,0.1), (1,0.7), (2,0.5)] ]

[out]：

[[[(0,0.7), (1,0.2), (2,0.4)], 
[(0,0.8), (1,0.2), (2,0.2)]] , 

[[(0,0.5), (1,0.9), (2,0.3)], 
[(0,0.1), (1,0.7), (2,0.5)], 
[(0,0.3), (1,0.8), (2,0.4)]] , 

[]]

注：我做不必須訪問原始文本時，數據來源我對管道的一部分。

如何將標籤和概率的元組列表進行聚類？在numpy,scipy,sklearn或任何python-able ML套件中是否有這樣的功能？甚至NLTK。

我們認爲羣集數是固定的，但羣集大小不是。

我只試圖尋找重心的最大值，但只給了我在每個集羣的第一個值：在每個羣集

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
         [(0,0.5), (1,0.3), (2,0.3)], 
         [(0,0.4), (1,0.4), (2,0.5)], 
         [(0,0.3), (1,0.7), (2,0.2)], 
         [(0,0.2), (1,0.6), (2,0.1)] ] 

# Find centroid. 
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0] 
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0] 
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0] 

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0] 
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0] 
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0] 

print instream[c1_centroid] 
print instream[c2_centroid] 
print instream[c2_centroid]

[out]（頂級元素：

[(0, 0.5), (1, 0.3), (2, 0.3)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
[(0, 0.3), (1, 0.7), (2, 0.2)]

來源

2014-01-08 alvas

如果您可以顯示某些輸入/輸出的示例，這將有所幫助。只是更多地解釋你究竟在做什麼，也要確保它不是[XY問題]（http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem）。 –

@InbarRose，我編輯了這個問題來給出更多的背景。 – alvas

你出來的第三行不應該是[（0,0.4），（1,0.4），（2,0.5）]'？ –

如果我理解正確，這就是你想要的。

import numpy as np 

N_TYPES = 3 

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
      [(0,0.5), (1,0.3), (2,0.3)], 
      [(0,0.4), (1,0.4), (2,0.5)], 
      [(0,0.3), (1,0.7), (2,0.2)], 
      [(0,0.2), (1,0.6), (2,0.1)] ] 
instream = np.array(instream) 

# this removes document tags because we only consider probabilities here 
values = [map(lambda x: x[1], doc) for doc in instream] 

# determine the cluster of each document by using maximum probability 
belongs_to = map(lambda x: np.argmax(x), values) 
belongs_to = np.array(belongs_to) 

# construct clusters of indices to your instream 
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)] 

# apply the indices to obtain full output 
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]

輸出out：

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], 

[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]], 
    [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]], 
    [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], 

[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我用numpy陣列，因爲它們能夠很好的搜索和索引。例如，表達式(belongs_to == 1).nonzero()[0]將索引數組返回到數組belongs_to，其中值爲1。索引的示例是instream[cluster_indices[2]]。

來源

2014-01-08 16:57:11 islijepcevic

爲什麼要保持元組中的索引？ 0,1和2是多餘的，如果我理解正確，則不提供任何信息。只需將n_samples x 3概率列表提供給任何scikit-learn算法即可。或者，如果您只想要最可能的標籤分配，請執行np.argmax(X, axis=1)。

來源

2014-01-09 00:08:10

我如何聚簇元組列表（標籤，概率）列表？ - python

回答

相關問題