R按組的統計來自數據框

如果這是一個重複，我真的不知道我正在嘗試實現的正確術語。R按組的統計來自數據框

我有藥物的實驗室結果的數據幀如下：

 
╔══════╦════════╗ 
║ drug ║ result ║ 
╠══════╬════════╣ 
║ A ║  10 ║ 
║ B ║ 150 ║ 
║ B ║  50 ║ 
║ A ║  14 ║ 
║ C ║  3 ║ 
║ C ║  7 ║ 
╚══════╩════════╝

對於每種藥物，我使用的dplyr去除異常值（>從平均4 SD的）使用以下：

cleaned <- data %>% group_by(drug) %>% filter(abs(result-mean(result))/sd(result) < 4)

但現在我想知道我有多少離羣百藥取出，所以基本上我想生成一個數據幀，看起來像如下：

 
╔══════╦═══════════╦══════════╦════════════╗ 
║ drug ║ total (N) ║ outliers ║ % outliers ║ 
╠══════╬═══════════╬══════════╬════════════╣ 
║ A ║  100 ║  7 ║ 0.07  ║ 
║ B ║  200 ║  45 ║ 0.225  ║ 
║ C ║  300 ║  99 ║ 0.33  ║ 
╚══════╩═══════════╩══════════╩════════════╝

這樣做的最好方法是什麼？

來源

2015-09-29 Alexander David

由於沒有樣本數據，我決定使用mtcars數據集進行演示。如果我遵循你的方法，以下將是一種方法。在這裏，你想找出你過濾掉的部分數據;您使用setdiff()來收集數據。由於am是此演示中的組變量，因此請使用count()並查找每個組有多少個異常值（即對於am爲0或1）。你進一步嘗試使用select和unlist來獲得你需要的向量。然後，使用summarise()並計算am存在多少個數據點，並使用mutate()添加新列。

library(dplyr) 
library(tidyr) 

mtcars %>% 
group_by(am) %>% 
filter(abs(disp-mean(disp))/sd(disp) < 1) %>% 
setdiff(mtcars, .) %>% 
count(am) %>% 
select(2) %>% 
unlist-> out 

#out 
#n1 n2 
#8 2 

summarize(group_by(mtcars, am), total = n()) %>% 
mutate(outliers = out, percent = outliers/total) 

#  am total outliers percent 
# (dbl) (int) (int)  (dbl) 
#1  0 19  8 0.4210526 
#2  1 13  2 0.1538462

以devmacrile的建議，我做了以下工作。首先，您使用組變量對數據進行分組。然後，你想建立一個標誌列。在這裏，我創建了mutate()這一列。您在列中有TRUE和FALSE。您可以通過count()來計算am和check存在多少個數據點。然後，您使用tidyr包中的spread()重塑結果。現在計算am中0組和1組的總數據點。再一次，您將數據與am分組，最後您將處理百分比計算和transmute()中的列重命名。我希望這個樣本能幫助你。

mtcars %>% 
group_by(am) %>% 
mutate(check = abs(disp-mean(disp))/sd(disp) < 1) %>% 
count(am, check) %>% 
spread(check, n) %>% 
mutate(total = `FALSE` + `TRUE`) %>% 
group_by(am) %>% 
transmute(total, outliers = `FALSE`, percentage = `FALSE`/total) 

#  am total outliers percentage 
# (dbl) (int) (int)  (dbl) 
#1  0 19  8 0.4210526 
#2  1 13  2 0.1538462

來源

2015-09-29 15:47:41 jazzurro

而不是正確的過濾（），我會創建一個標誌字段（即1或0），指示結果是否是異常值，然後將其轉換爲適當的摘要。

來源

2015-09-29 15:26:47 devmacrile

我只是想出瞭如何使用總結。我一直只使用R約一週，所以如果有更好的方式，請告訴我： 'isnorm < - function（x）{sum（abs（x-mean（x））/ sd（x）（數據％>％group_by（藥物），N = n（），normal = isnorm（測試），離羣值= N-正常，out_pct =離羣值/ N）' –

@AlexanderDavid Yeah <0121> ，看起來不錯。更習慣的是鏈接整個事情，如'data％>％group_by（drug）％>％summarize（...）' – Frank

真棒，謝謝你們！ –

R按組的統計來自數據框

回答

相關問題