2017-07-24 104 views
3

我在csv文件以下數據:For循環與子集R中

Date  Model  Color Value Samples 
6/19/2017 Gold  Blue  0.5  500 
6/19/2017 Gold  Red  0.0  449 
6/19/2017 Silver  Blue  0.75 1320 
6/19/2017 Silver  Blue  1.5  103 
6/19/2017 Gold  Red  0.7  891 
6/19/2017 Gold  Blue  0.41 18103 
6/19/2017 Copper  Blue  0.83 564 
6/19/2017 Silver  Pink  1.17 173 
6/19/2017 Platinum Brown 0.43 793 
6/19/2017 Platinum Red  0.71 1763 
6/19/2017 Gold  Orange 1.92 503 

我使用fread函數來創建data.table:

library(dplyr) 
library(data.table) 

df <- fread("test_data.csv", 
       header = TRUE, 
       fill = TRUE, 
       sep = ",") 

我然後子集中的數據通過Model,如下:

df_subset <- subset(df, df$Model=='Gold' & df$Value > 0) 

然後,我創建基於一些百分變量,如下所示:

df_subset[, .(Samples = sum(Samples), 
    '50th' = quantile(AvgValue, probs = c(0.50)), 
    '99th' = quantile(AvgValue, probs = c(0.99)), 
    '99.9th' = quantile(AvgValue, probs = c(0.999)), 
    '99.99th' = quantile(AvgValue, probs = c(0.9999))), 
by = Color] 

這給下面的輸出:

Color Samples 50th 99th 99.9th 99.99th 
1: Blue 18603 0.455 0.4991 0.49991 0.499991 
2: Red 1340 0.975 1.2445 1.24945 1.249945 
3: Orange  503 1.920 1.9200 1.92000 1.920000 

我試圖通過Model值和輸出相關的百分位值的列表中爲每個Model值進行迭代。

我已經試過以下(這失敗):

models <- unique(df$Model) 

for (model in models){ 

    df$model[, .(Samples = sum(Samples), 
       '50th' = quantile(Value, probs = c(0.50)), 
       '99th' = quantile(Value, probs = c(0.99)), 
       '99.9th' = quantile(Value, probs = c(0.999)), 
       '99.99th' = quantile(Value, probs = c(0.9999))), 
      by = Color] 
} 

的錯誤信息是:

Error in .(Samples = sum(Samples), `50th` = quantile(Value, probs = c(0.5)), : could not find function "." 
+0

'dplyr'包:'group_by'和'發生變異無需一個for循環或者,我們可以在一行代碼使用兩個變量列表中by參數循環在兩個型號和顏色'。 – Masoud

+0

什麼是「AvgValue」? – dww

回答

2

fread創建data.table對象,而不是數據幀,所以我會建議堅持使用data.table語法,不要將其與dplyr混合。

qs = df[Value > 0, .(Samples = sum(Samples), 
       '50th' = quantile(Value, probs = c(0.50)), 
       '99th' = quantile(Value, probs = c(0.99)), 
       '99.9th' = quantile(Value, probs = c(0.999)), 
       '99.99th' = quantile(Value, probs = c(0.9999))), 
      by = .(Model, Color)] 
setkey(qs, 'Model') 

#  Model Color Samples 50th 99th 99.9th 99.99th 
# 1: Copper Blue  564 0.830 0.8300 0.83000 0.830000 
# 2:  Gold Blue 18603 0.455 0.4991 0.49991 0.499991 
# 3:  Gold Red  891 0.700 0.7000 0.70000 0.700000 
# 4:  Gold Orange  503 1.920 1.9200 1.92000 1.920000 
# 5: Platinum Brown  793 0.430 0.4300 0.43000 0.430000 
# 6: Platinum Red 1763 0.710 0.7100 0.71000 0.710000 
# 7: Silver Blue 1423 1.125 1.4925 1.49925 1.499925 
# 8: Silver Pink  173 1.170 1.1700 1.17000 1.170000 
2

這可能會解決您的問題

library(dplyr) 

df [,-1] %>% filter(Value > 0) %>% group_by(Model, Color) %>% 
     do(data.frame(t(quantile(.$Value, probs = c(0.50, 0.99, 0.999, 0.9999))))) 

關於你的問題在評論中,關於如何連接樣本總和:您可以使用aggregate;我不使用dplyr::summarise的原因是我需要在應用do之後開始新的管道系統,這是沒有意義的。

data.frame(df %>% filter(Value > 0) %>% select(-Date) %>% group_by(Model, Color) %>% 
       do(data.frame(t(quantile(.$Value, probs = c(0.50, 0.99, 0.999, 0.9999))))), 
      aggregate(Samples ~ Color+Model, df, sum)["Samples"]) 

#  Model Color X50. X99. X99.9. X99.99. Samples 
# 1 Copper Blue 0.830 0.8300 0.83000 0.830000  564 
# 2  Gold Blue 0.455 0.4991 0.49991 0.499991 18603 
# 3  Gold Orange 1.920 1.9200 1.92000 1.920000  503 
# 4  Gold Red 0.700 0.7000 0.70000 0.700000 1340 
# 5 Platinum Brown 0.430 0.4300 0.43000 0.430000  793 
# 6 Platinum Red 0.710 0.7100 0.71000 0.710000 1763 
# 7 Silver Blue 1.125 1.4925 1.49925 1.499925 1423 
# 8 Silver Pink 1.170 1.1700 1.17000 1.170000  173 

數據:

df <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = "6/19/2017", class = "factor"), Model = structure(c(2L, 
2L, 4L, 4L, 2L, 2L, 1L, 4L, 3L, 3L, 2L), .Label = c("Copper", 
"Gold", "Platinum", "Silver"), class = "factor"), Color = structure( 
c(1L,5L, 1L, 1L, 5L, 1L, 1L, 4L, 2L, 5L, 3L), .Label = c("Blue", "Brown", 
"Orange", "Pink", "Red"), class = "factor"), Value = c(0.5, 0, 
0.75, 1.5, 0.7, 0.41, 0.83, 1.17, 0.43, 0.71, 1.92), Samples = c(500L, 
449L, 1320L, 103L, 891L, 18103L, 564L, 173L, 793L, 1763L, 503L)), 
.Names = c("Date", "Model", "Color", "Value", "Samples"), 
class = "data.frame", row.names = c(NA, -11L)) 
+0

該代碼將如何修改以輸出樣本?謝謝。 – equanimity

+0

@equanimity如果你仍然感興趣,看看更新。 – Masoud

1

使用您的定義,你可以嘗試以下方法:

library(data.table) 
df<-fread("~/theData.csv") 
df$Value<-as.numeric(df$Value) 
result<-data.frame() 
for (i in seq_along(unique(df$Model))){ 
    temp <- subset(df, df$Model==unique(df$Model)[i] & df$Value > 0) 
    temp<-temp[, .(Samples = sum(Samples), 
    '50th' = quantile(Value, probs = c(0.50)), 
    '99th' = quantile(Value, probs = c(0.99)), 
    '99.9th' = quantile(Value, probs = c(0.999)), 
    '99.99th' = quantile(Value, probs = c(0.9999))), 
    by = Color] 
    temp$model<-unique(df$Model)[i] 
    result<-rbind(result, temp) 
} 
rm(temp)