Python KMeans橙色框架

我打算爲kmeans聚類使用orange。我已經閱讀了教程，但我仍然有幾個問題想要問：Python KMeans橙色框架

我正在處理高維向量上的聚類問題。 1）是否實現了餘弦距離？ 2）我不想給空值賦零。我試過在空字段中沒有任何零，並且出現錯誤：

SystemError: 'orange.TabDelimExampleGenerator': the number of attribute types does not match the number of attributes

如何指示空值？ 3）有沒有一種方法可以將「ID」合併到示例表中？我想用一個ID（不分類）標記我的數據以便於參考。我不會將ID列作爲我的數據的官方部分。

4）kmeans聚類有沒有辦法輸出不同的方法？我更喜歡這種格式的東西：

cluster1: [ <id1>, <id2>, ...] 
cluster2: [ <id3>, ... ] 
rather than just [1, 2, 3,1 , 2, ... ]

謝謝！

來源

2010-02-07 alskndalsnd

在一個問題中的四個問題是非常尷尬的 - 爲什麼不提出問題的一個問題？這並不像它會花費你;-)。無論如何，WRT「我如何表明一個空值？」，見the docs關於Orange.Value實例的屬性value：

If value is continuous or unknown, no descriptor is needed. For the latter, the result is a string '?', '~' or '.' for don't know, don't care and other, respectively.

我不知道如果空你的意思是「不知道」或「唐不在乎「，但無論如何你可以指出。

Unknown values are treated correctly only by Euclidean and Relief distance. For other measure of distance, a distance between unknown and known or between two unknown values is always 0.5.

在後者的頁面列出的距離是海明，最大，曼哈頓，歐幾里德和救濟（後者是像曼哈頓但正確的治療：在the docs從這個其他網頁 - 以關心的距離，但是未知值） - 沒有提供餘弦距離：你必須自己編碼。

對於（4），只需要一點Python代碼就可以以任何想要的方式格式化結果。一個KMEANS對象的.clusters屬性是一個列表，恰好只要數據實例的數量：如果你想要的是數據實例的列表的列表，例如：

def loldikm(data, **k): 
    km = orange.KMeans(data, **k) 
    results = [[] for _ in km.centroids] 
    for i, d in zip(km.clusters, data): 
    results[i].append(d)

來源

2010-02-07 17:02:51

我覺得原來的k均值爲不適合餘弦距離。對於它不在歐幾里得空間中，您需要定義餘弦距離的質心，並且不能保證收斂。但是如果你的特徵向量都是正面的，你可以試試。更多信息：Add API for user defined distance function in k-means

來源

2015-01-20 03:37:04

Python KMeans橙色框架

回答

相關問題