我有一堆文本,它們被分類到不同的類別中,然後每個文檔都以每個標籤的概率標記爲0,1或2。我如何聚簇元組列表(標籤,概率)列表? - python
[ "this is a foo bar",
"bar bar black sheep",
"sheep is an animal"
"foo foo bar bar"
"bar bar sheep sheep" ]
在管道前面的工具返回的元組作爲這樣的列表的列表,在所述外列表中的每個元素是排序文檔。
[ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
我需要它,看看哪些標籤中的每個元組的名單是最可能的實現:我只能用這樣的事實,我知道每一個文件的標籤0,1或2,其概率爲這樣的工作:
[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
[[(0,0.4), (1,0.4), (2,0.5)]] ]
作爲另一個例子:
[in]
:
[ [(0,0.7), (1,0.2), (2,0.4)],
[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.3), (1,0.8), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)],
[(0,0.1), (1,0.7), (2,0.5)] ]
[out]
:
[[[(0,0.7), (1,0.2), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)]] ,
[[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.1), (1,0.7), (2,0.5)],
[(0,0.3), (1,0.8), (2,0.4)]] ,
[]]
注:我做不必須訪問原始文本時,數據來源我對管道的一部分。
如何將標籤和概率的元組列表進行聚類?在numpy
,scipy
,sklearn
或任何python-able ML套件中是否有這樣的功能?甚至NLTK
。
我們認爲羣集數是固定的,但羣集大小不是。
我只試圖尋找重心的最大值,但只給了我在每個集羣的第一個值:在每個羣集
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
# Find centroid.
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]
c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]
print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]
[out]
(頂級元素:
[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
如果您可以顯示某些輸入/輸出的示例,這將有所幫助。只是更多地解釋你究竟在做什麼,也要確保它不是[XY問題](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)。 –
@InbarRose,我編輯了這個問題來給出更多的背景。 – alvas
你出來的第三行不應該是[(0,0.4),(1,0.4),(2,0.5)]'? –