-2

運行k-means(mllib spark scala)後,我想理解從預處理的數據(其他變換器中)獲得的聚類中心mllib OneHotEncoder。如何恢復Spark中的單熱編碼(Scala)

中心看起來是這樣的:

集羣中心0 0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0, 0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]

這顯然不是非常人性化的任何想法如何恢復單熱編碼和檢索原始分類功能? 如果我查找與質心最接近的數據點(使用k-means使用的相同距離度量,我假設是歐幾里得距離),然後恢復該特定數據點的編碼?

回答

1

對於羣集質心,不可能(強烈推薦)反轉編碼。想象一下你有6個原始特徵「3」,它編碼爲[0.0,0.0,1.0,0.0,0.0,0.0]。在這種情況下,很容易從編碼中提取3作爲正確的特徵。

但是在kmeans應用程序之後,您可能會得到一個類似於此功能的羣集質心,如[0.0,0.13,0.0,0.77,0.1,0.0]。如果您想將其解碼爲之前的表示,例如6中的「4」,因爲特徵4具有最大值,那麼您將丟失信息並且該模型可能會損壞。

編輯:添加一個可行的辦法,以恢復從意見的答案數據點編碼

如果您對數據點的ID,您可以執行選擇/上的ID連接操作你分配一個數據點後在編碼之前到羣集以獲得舊狀態。

+0

謝謝!我明白你的答案。如果我查找與質心最接近的數據點(使用k-means使用的相同距離度量,我假設它是歐幾里得距離),然後恢復該特定數據點的編碼? –

+1

@JoãoMoura然後,我認爲最簡單的事情是在每個數據點上都有ID,並且在爲其羣集分配一個點之後,通過ID檢索原始值。然後,您不需要還原編碼,而是對原始數據集和編碼數據集執行簡單的選擇/連接操作。 –