R中的多列數據幀的異常值檢測

我有一個有18列和大約12000行的數據幀。我想找到前17列的異常值，並將結果與第18列進行比較。第18列是一個因子，包含可用作離羣值指標的數據。R中的多列數據幀的異常值檢測

我的數據幀是飛碟和我除去柱18如下：

ufo2 <- ufo[,1:17]

，然後將3- non0numeric列數值：

ufo2$Weight <- as.numeric(ufo2$Weight) 
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue) 
ufo2$Score <- as.numeric(ufo2$Score)

，然後使用離羣以下命令檢測：

outlier.scores <- lofactor(ufo2, k=5)

但是，outlier.scores的所有元素都是NA！

我在這段代碼中有任何錯誤嗎？

是否有另一種方法來找到這樣的數據框異常？

我的所有代碼：

setwd(datadirectory) 
library(doMC) 
registerDoMC(cores=8) 

library(DMwR) 

# load data 
load("data_9802-f2.RData") 

ufo2 <- ufo[,2:17] 

ufo2$Weight <- as.numeric(ufo2$Weight) 
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue) 
ufo2$Score <- as.numeric(ufo2$Score) 

outlier.scores <- lofactor(ufo2, k=5)

的dput的輸出（頭（ufo2））是：

structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L, 
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L, 
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L, 
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L 
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L, 
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L, 
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900, 
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896, 
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L, 
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667, 
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93, 
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787, 
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894 
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin", 
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country", 
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight", 
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA, 
6L), class = "data.frame")

來源

2013-10-02 Mohammad

嗨，歡迎來到stackoverflow！如果您提供[最小，可重現的數據集]，您更有可能收到答案（http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610 ＃5963610）以及您嘗試過的代碼。謝謝！ – Henrik

謝謝！數據集很安靜，但我的代碼與上面一樣，你還需要回答我的問題嗎？ – Mohammad

您的數據樣本（使用'dput（head（ufo2））'提供）以及您已加載的包。只是一個猜測：在使用'as.numeric'之後你有沒有看過你的數據？ – Roland

首先，你需要花更多的時間預處理您的數據爲。你的軸有完全不同的含義和規模。如果不注意，異常值檢測結果將毫無意義，因爲它們基於無意義的距離。例如。你確定，這應該是你的相似度的一部分嗎？

另外請注意，我發現lofactor包的執行速度非常慢。另外，它似乎是硬連線到歐幾里德距離！

相反，我建議使用ELKI進行異常檢測。首先，它具有更廣泛的算法選擇，其次它比R快得多，第三，它非常模塊化和靈活。對於您的用例，您可能需要實現自定義距離函數而不是使用歐幾里德距離。

以下是關於implementing a custom distance function的ELKI教程的鏈接。

來源

2013-10-03 12:54:39

R中的多列數據幀的異常值檢測

回答

相關問題