運行k-means(mllib spark scala)後,我想理解從預處理的數據(其他變換器中)獲得的聚類中心mllib OneHotEncoder。如何恢復Spark中的單熱編碼(Scala)
中心看起來是這樣的:
集羣中心0 0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0, 0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
這顯然不是非常人性化的任何想法如何恢復單熱編碼和檢索原始分類功能? 如果我查找與質心最接近的數據點(使用k-means使用的相同距離度量,我假設是歐幾里得距離),然後恢復該特定數據點的編碼?
謝謝!我明白你的答案。如果我查找與質心最接近的數據點(使用k-means使用的相同距離度量,我假設它是歐幾里得距離),然後恢復該特定數據點的編碼? –
@JoãoMoura然後,我認爲最簡單的事情是在每個數據點上都有ID,並且在爲其羣集分配一個點之後,通過ID檢索原始值。然後,您不需要還原編碼,而是對原始數據集和編碼數據集執行簡單的選擇/連接操作。 –