1

我在做一些數據的kNN分類。我有隨機分配的數據,以80/20的比例進行訓練和測試。 我的數據是這樣的:
特徵歸一化後,kNN分類的準確率下降了嗎?

[ [1.0, 1.52101, 13.64, 4.49, 1.1, 71.78, 0.06, 8.75, 0.0, 0.0, 1.0], 
    [2.0, 1.51761, 13.89, 3.6, 1.36, 72.73, 0.48, 7.83, 0.0, 0.0, 2.0], 
    [3.0, 1.51618, 13.53, 3.55, 1.54, 72.99, 0.39, 7.78, 0.0, 0.0, 3.0], 
    ... 
] 

項目在矩陣的最後一列是類:1.0,2.0和3.0

特徵正規化我的數據是這樣的:

[[-0.5036443480260487, -0.03450760227559746, 0.06723230162846759, 0.23028986544844693, -0.025324623254270005, 0.010553065215338569, 0.0015136367098358505, -0.11291235596166802, -0.05819669234942126, -0.12069793876044387, 1.0], 
[-0.4989050339943617, -0.11566537753097901, 0.010637426608816412, 0.2175704556290625, 0.03073267976659575, 0.05764598316498372, -0.012976783512350588, -0.11815839520204152, -0.05819669234942126, -0.12069793876044387, 2.0], 
... 
] 

我用於標準化的公式:

(X - avg(X))/(max(X) - min(X)) 

我執行的kNN分類倍每個的K = 1〜25(奇數只數)。我使用K中的每一個記錄平均精度。 這裏是我的結果:

Average accuracy for K=1 after 100 tests with different data split: 98.91313003886198 % 
Average accuracy for K=3 after 100 tests with different data split: 98.11976006170633 %  
Average accuracy for K=5 after 100 tests with different data split: 97.71226079929019 % 
Average accuracy for K=7 after 100 tests with different data split: 97.47493145754373 %  
Average accuracy for K=9 after 100 tests with different data split: 97.16596220947888 % 
Average accuracy for K=11 after 100 tests with different data split: 96.81465365733266 % 
Average accuracy for K=13 after 100 tests with different data split: 95.78772655522567 %  
Average accuracy for K=15 after 100 tests with different data split: 95.23116406332706 %  
Average accuracy for K=17 after 100 tests with different data split: 94.52371789094929 %  
Average accuracy for K=19 after 100 tests with different data split: 93.85285871435981 % 
Average accuracy for K=21 after 100 tests with different data split: 93.26620809747965 %  
Average accuracy for K=23 after 100 tests with different data split: 92.58047022661833 % 
Average accuracy for K=25 after 100 tests with different data split: 90.55746523509124 % 

但是當我申請特徵正規化的準確率顯著下降。 我的kNN的結果與規格化特徵:

Average accuracy for K=1 after 100 tests with different data split: 88.56128075154439 % 
Average accuracy for K=3 after 100 tests with different data split: 85.01466511662318 %  
Average accuracy for K=5 after 100 tests with different data split: 83.32096281613967 %  
Average accuracy for K=7 after 100 tests with different data split: 83.09434478900455 % 
Average accuracy for K=9 after 100 tests with different data split: 82.05628926919964 % 
Average accuracy for K=11 after 100 tests with different data split: 79.89732262550343 % 
Average accuracy for K=13 after 100 tests with different data split: 79.60617886853211 %  
Average accuracy for K=15 after 100 tests with different data split: 79.26511126374507 %  
Average accuracy for K=17 after 100 tests with different data split: 77.51457877706329 % 
Average accuracy for K=19 after 100 tests with different data split: 76.97848441605367 %  
Average accuracy for K=21 after 100 tests with different data split: 75.70005919265326 %  
Average accuracy for K=23 after 100 tests with different data split: 76.45758217099551 % 
Average accuracy for K=25 after 100 tests with different data split: 76.16619492431572 % 

我的代碼算法沒有邏輯錯誤,我檢查了它的簡單數據。


爲什麼k近鄰分類的準確率下降後特徵正規化這麼多?我想歸一化本身不應該會降低任何分類的準確率。那麼使用特徵標準化的目的是什麼?

回答

3

這是一個普遍的誤解,規範化將永遠不會降低分類的準確性。它很好。

如何?

連續的相對值也非常重要。他們的確確定了特徵空間中點的位置。當你進行標準化時,它會嚴重抵消相對位置。這被感覺到,特別是在k-NN分類中,因爲它直接相對於點之間的距離進行操作。與此相比,它在SVM中的效果並不強烈,因爲在這種情況下,優化過程仍然能夠找到合理準確的超平面。

您還應該注意,在這裏,您使用avg(X)進行歸一化。因此,考慮特定行的相鄰列中的兩個點。如果第一個點遠低於平均值,第二個點遠遠高於各自列的平均值,而在非規範化意義上,它們是非常接近的數值,距離計算可能會有很大差異。

不要指望正常化會創造奇蹟。

2

KNN的工作方式是找到與它類似的實例。它計算兩點之間的Euclidean Distance。現在通過標準化,你正在改變改變你的準確性的功能的規模。

this的研究。去數字你會發現不同的縮放技術給不同的精度。