2017-07-31 92 views
0

我想消除高於或低於2個標準差的離羣值,對於具有類似名稱的許多變量(太多到代碼分別指定)。篩選多個存在的R data.table列以消除異常值

library(data.table) 

irisdt <- data.table(iris) 
myCols <- grep("Sepal", colnames(irisdt), value=TRUE) 

# This works if I specify one column, 
# but I have too many columns to specify, so need to use grep approach. 
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)] 

# This does not work 
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)})] 

# This partially works, but changes in place 
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)})] 
# How do I make new variables, for example "Sepal.Length.Outlier"? 

myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE) 

# How do I select rows matching multiple columns (&)? 
irisdt[myOutlierCols=="FALSE"] # does not work 
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work 
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work 

也許函數可能需要一個data.table列並將其剝離高於或低於z分數截止值。這可以與lapply一起使用。

# This does not work 
removeOutliers <- function(myColumn, cutoff = 3) { 
    lapply(myColumn, function (x) { 
    if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) { 
     x <- NA #specify individual value instead of column? 
    } 
    }) 
} 
removeOutliers(irisdt[,Sepal.Length]) # for testing 
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable 

# Once outliers are made NA, this would work: 
trimmedIrisdt <- complete.cases(trimmedIrisdt) 

回答

2

我猜,這達到了目標:

irisdt[, keep := 
    as.logical(do.call(pmin, lapply(.SD, function(x) abs(scale(x)) <= 2))) 
, .SDcols = myCols] 

res = irisdt[(keep), !"keep"] 

    Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
    1:   5.1   3.5   1.4   0.2 setosa 
    2:   4.9   3.0   1.4   0.2 setosa 
    3:   4.7   3.2   1.3   0.2 setosa 
    4:   4.6   3.1   1.5   0.2 setosa 
    5:   5.0   3.6   1.4   0.2 setosa 
---                
135:   6.7   3.0   5.2   2.3 virginica 
136:   6.3   2.5   5.0   1.9 virginica 
137:   6.5   3.0   5.2   2.0 virginica 
138:   6.2   3.4   5.4   2.3 virginica 
139:   5.9   3.0   5.1   1.8 virginica 

如果有分組變量這應該也正常工作。我不知道它的統計可靠性。


工作原理:

  1. 測試每一個電池abs(scale(x)) <= 2
  2. 如果跨列的最小結果爲TRUE,則保留該行。

要看看它是如何工作的細胞通過細胞...

library(data.table) 

mynewCols = paste0(myCols,"_outly") 
irisdt[, (mynewCols) := 
    lapply(.SD, function(x) replace(x, abs(scale(x)) <= 2, NA)) 
, .SDcols = myCols] 

然後瀏覽喜歡View(irisdt[rowSums(!is.na(irisdt[, ..mynewCols])) > 0])

+1

謝謝你的非常簡潔,明確的答案。這比我想要的方式要好得多! –

+0

我試圖修改它以用NA代替所有值 abs(scale(x))> = 2。 這是我嘗試(不工作): irisdt [(myCols):= lapply(.SD,函數(X)(如果(as.logical(do.call(PMIN,lapply(.SD,函數( x)的ABS(刻度(X))<= 2)))) {NA}否則{X})) ,.SDcols = myCols] –

+0

而且這不工作以替換細胞:irisdt [(myCols): (x){if(abs(scale(x))<= 2){x} else {NA}}),.SDcols = myCols]。你能解釋一下do.call(pmin,...)嗎? –