0
我想消除高於或低於2個標準差的離羣值,對於具有類似名稱的許多變量(太多到代碼分別指定)。篩選多個存在的R data.table列以消除異常值
library(data.table)
irisdt <- data.table(iris)
myCols <- grep("Sepal", colnames(irisdt), value=TRUE)
# This works if I specify one column,
# but I have too many columns to specify, so need to use grep approach.
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)]
# This does not work
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)})]
# This partially works, but changes in place
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)})]
# How do I make new variables, for example "Sepal.Length.Outlier"?
myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE)
# How do I select rows matching multiple columns (&)?
irisdt[myOutlierCols=="FALSE"] # does not work
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work
也許函數可能需要一個data.table列並將其剝離高於或低於z分數截止值。這可以與lapply一起使用。
# This does not work
removeOutliers <- function(myColumn, cutoff = 3) {
lapply(myColumn, function (x) {
if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) {
x <- NA #specify individual value instead of column?
}
})
}
removeOutliers(irisdt[,Sepal.Length]) # for testing
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable
# Once outliers are made NA, this would work:
trimmedIrisdt <- complete.cases(trimmedIrisdt)
謝謝你的非常簡潔,明確的答案。這比我想要的方式要好得多! –
我試圖修改它以用NA代替所有值 abs(scale(x))> = 2。 這是我嘗試(不工作): irisdt [(myCols):= lapply(.SD,函數(X)(如果(as.logical(do.call(PMIN,lapply(.SD,函數( x)的ABS(刻度(X))<= 2)))) {NA}否則{X})) ,.SDcols = myCols] –
而且這不工作以替換細胞:irisdt [(myCols): (x){if(abs(scale(x))<= 2){x} else {NA}}),.SDcols = myCols]。你能解釋一下do.call(pmin,...)嗎? –