2013-04-18 17 views
2

我有一個數據幀,假設這樣的:如何檢測數據框的列中的異常值?中的R

names<-c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","c","c","c") 
var1<-c(0.942999593,0.935507266,0.973589623,0.969415912,0.95230801,0.935507266,0.888740961,0.91750551,0.944482672,0.945468585,1.457579147,0.922206277,0.941511433,0.954724791,0.941014244,0.941511433,0.941511433,1.50511433) 
var2<-c(-0.012678088,0.014313763,0.001138275,-0.020568206,0.012987126,0.001217192,0.03360358,0.009758172,0.015066932,-0.037879492,0.020471157,0.010738162,0.010952531,0.019377213,0.027140572,0.031116892,-0.018530676,-8.90E-05) 
as.data.frame(cbind(names,var1,var2))->df 

我想異常值與Na轉換中的列VAR1和VAR2。不過,我想要爲「名稱」列中的每個類別獨立計算異常值。所以var1中的「a」的異常值將是var1中前5行的異常值。

我檢測異常值的方式是所有值,低於或高於分位數0.25和0.75。

有沒有簡單的方法來做到這一點在R?

非常感謝你提前。

Tina。

+1

你可以通過變量'names'在你的'df'上使用'split()'並檢測你的外層(不過你可以定義它)。 –

+1

@JuliánUrbano:它怎麼會沒有?她要求的是分位數值,而不是絕對數值 –

+0

@CarlWitthoft right..didn't read that right;) –

回答

4

這裏是你如何能爲VAR1做到這一點:

quantiles<-tapply(var1,names,quantile) 
minq <- sapply(names, function(x) quantiles[[x]]["25%"]) 
maxq <- sapply(names, function(x) quantiles[[x]]["75%"]) 
var1[var1<minq | var1>maxq] <- NA 

重複相同的VAR2(或DF $ VAR2)。

+1

我要說的是,但這是更清潔的(注意'minq'方便的方式匹配關卡) 。 –

+0

非常感謝! – user18441