2014-01-08 26 views
3

我有一堆文本,它們被分類到不同的類別中,然後每個文檔都以每個標籤的概率標記爲0,1或2。我如何聚簇元組列表(標籤,概率)列表? - python

[ "this is a foo bar", 
    "bar bar black sheep", 
    "sheep is an animal" 
    "foo foo bar bar" 
    "bar bar sheep sheep" ] 

在管道前面的工具返回的元組作爲這樣的列表的列表,在所述外列表中的每個元素是排序文檔。

[ [(0,0.3), (1,0.5), (2,0.1)], 
    [(0,0.5), (1,0.3), (2,0.3)], 
    [(0,0.4), (1,0.4), (2,0.5)], 
    [(0,0.3), (1,0.7), (2,0.2)], 
    [(0,0.2), (1,0.6), (2,0.1)] ] 

我需要它,看看哪些標籤中的每個元組的名單是最可能的實現:我只能用這樣的事實,我知道每一個文件的標籤0,1或2,其概率爲這樣的工作:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] , 
    [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] , 
    [[(0,0.4), (1,0.4), (2,0.5)]] ] 

作爲另一個例子:

[in]

[ [(0,0.7), (1,0.2), (2,0.4)], 
    [(0,0.5), (1,0.9), (2,0.3)], 
    [(0,0.3), (1,0.8), (2,0.4)], 
    [(0,0.8), (1,0.2), (2,0.2)], 
    [(0,0.1), (1,0.7), (2,0.5)] ] 

[out]

[[[(0,0.7), (1,0.2), (2,0.4)], 
[(0,0.8), (1,0.2), (2,0.2)]] , 

[[(0,0.5), (1,0.9), (2,0.3)], 
[(0,0.1), (1,0.7), (2,0.5)], 
[(0,0.3), (1,0.8), (2,0.4)]] , 

[]] 

注:我做必須訪問原始文本時,數據來源我對管道的一部分。

如何將標籤和概率的元組列表進行聚類?在numpy,scipy,sklearn或任何python-able ML套件中是否有這樣的功能?甚至NLTK

我們認爲羣集數是固定的,但羣集大小不是。

我只試圖尋找重心的最大值,但只給了我在每個集羣的第一個值:在每個羣集

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
         [(0,0.5), (1,0.3), (2,0.3)], 
         [(0,0.4), (1,0.4), (2,0.5)], 
         [(0,0.3), (1,0.7), (2,0.2)], 
         [(0,0.2), (1,0.6), (2,0.1)] ] 

# Find centroid. 
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0] 
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0] 
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0] 

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0] 
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0] 
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0] 

print instream[c1_centroid] 
print instream[c2_centroid] 
print instream[c2_centroid] 

[out](頂級元素:

[(0, 0.5), (1, 0.3), (2, 0.3)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
[(0, 0.3), (1, 0.7), (2, 0.2)] 
+0

如果您可以顯示某些輸入/輸出的示例,這將有所幫助。只是更多地解釋你究竟在做什麼,也要確保它不是[XY問題](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)。 –

+0

@InbarRose,我編輯了這個問題來給出更多的背景。 – alvas

+2

你出來的第三行不應該是[(0,0.4),(1,0.4),(2,0.5)]'? –

回答

2

如果我理解正確,這就是你想要的。

import numpy as np 

N_TYPES = 3 

instream = [ [(0,0.3), (1,0.5), (2,0.1)], 
      [(0,0.5), (1,0.3), (2,0.3)], 
      [(0,0.4), (1,0.4), (2,0.5)], 
      [(0,0.3), (1,0.7), (2,0.2)], 
      [(0,0.2), (1,0.6), (2,0.1)] ] 
instream = np.array(instream) 

# this removes document tags because we only consider probabilities here 
values = [map(lambda x: x[1], doc) for doc in instream] 

# determine the cluster of each document by using maximum probability 
belongs_to = map(lambda x: np.argmax(x), values) 
belongs_to = np.array(belongs_to) 

# construct clusters of indices to your instream 
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)] 

# apply the indices to obtain full output 
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)] 

輸出out

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]], 

[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]], 
    [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]], 
    [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]], 

[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]] 

我用numpy陣列,因爲它們能夠很好的搜索和索引。例如,表達式(belongs_to == 1).nonzero()[0]將索引數組返回到數組belongs_to,其中值爲1。索引的示例是instream[cluster_indices[2]]

0

爲什麼要保持元組中的索引? 0,12是多餘的,如果我理解正確,則不提供任何信息。只需將n_samples x 3概率列表提供給任何scikit-learn算法即可。 或者,如果您只想要最可能的標籤分配,請執行np.argmax(X, axis=1)