2014-08-27 36 views
5

我想使用R dplyr包,以計算下面的時間間隔相關的問題,而無需使用循環:dplyr和間隔:計數觀察和總和的數據,而不使用循環

  1. 我想計數在每個觀測間隔(絕對和相對間隔端點)
  2. 我想在每個間隔中總結的觀測數據(絕對和相對間隔端點)

的間隔端點是從柱df_abs $間隔和df_rel $間隔。例如

  1. 間隔:(-INF,-60]
  2. 間隔:(-60,-30]
  3. 間隔:(-30,0]

的數據與所述數據幀和時間間隔是這樣的:

library(dplyr) 

# ----------{ data and interval ---------- 
df_data <- data.frame(varA = NA, 
         varB = NA, 
         varC = c(-81.0, -14.3, 29.6, 42.7, 46.4, 57.7, 15.3, 256.3, 20.3, -25.1, -23.1, -17.5)) 

df_abs <- data.frame(interval = c(-Inf, -60, -30, 0, 30, 60, 100, 200, Inf), 
        count = NA, 
        sum = NA) 

df_rel <- data.frame(interval = c(0,5,15,50,75,95,100), 
        count = NA, 
        sum = NA) 
# ---------- data and interval }---------- 


# ----------{ calculation ----------  
# absolute data frame 
for (i in 1 : nrow(df_abs)-1) { 
    # count observation between interval 
    df_abs$count[i+1] <- summarise(df_data, sum(df_abs$interval[i] < varC & varC <= df_abs$interval[i+1])) 

    # sum between interval 
    df_abs$sum[i+1] <- sum(df_data$varC[df_abs$interval[i] < df_data$varC & df_data$varC <= df_abs$interval[i+1]]) 
} 


# relative data frame 
df_data_arranged <- df_data %>% 
         arrange(varC) %>% 
         mutate(observationPercent = c(1:nrow(df_data)) * 100/length(df_data$varC)) 


for (i in 1 : nrow(df_rel)-1) { 
    # count observation between interval 
    df_rel$count[i+1] <- summarise(df_data_arranged, sum(df_rel$interval[i] < observationPercent & observationPercent <= df_rel$interval[i+1])) 

    # sum between interval 
    df_rel$sum[i+1] <- sum(df_data_arranged$varC[df_rel$interval[i] < df_data_arranged$observationPercent & df_data_arranged$observationPercent <= df_rel$interval[i+1]]) 
}  
# ---------- calculation }---------- 

答案應該是這樣的:

df_abs <- data.frame(interval = c(-Inf, -60, -30, 0, 30, 60, 100, 200, Inf), 
        count = c(0,1,0,4,3,3,0,0,1), 
        sum = c(0,-81,0,-80,65.2,146.8,0,0,256.3))  

df_rel <- data.frame(interval = c(0,5,15,50,75,95,100), 
        count = c(0,0,1,4,3,2,1), 
        sum = c(0,0,-81,-39.6,92.6,104.1,256.3)) 

就我所瞭解的dplyr軟件包而言,對於這兩個問題中的每一個都應該有一個相當簡短和直接的解決方案,而不必使用循環。

回答

2

這可以如下進行:

  • 創建新列(mutate),以識別哪個觀察屬於哪個區間(通過base::cut

  • 組觀察結果由區間(group_by

  • 將結果應用於您的操作(summarisedplyrn()和一個co MMON sum這裏)

如下:

df_abs <- mutate(df_data, interval = cut(varC, df_abs$interval)) %>% 
    group_by(interval) %>% 
    summarise(count=n(), sum=sum(varC)) 
#  interval count sum 
#1 (-Inf,-60]  1 -81.0 
#2 (-30,0]  4 -80.0 
#3  (0,30]  3 65.2 
#4 (30,60]  3 146.8 
#5 (200, Inf]  1 256.3 

df_rel <- mutate(df_data_arranged, 
       interval = cut(observationPercent, df_rel$interval)) %>% 
    group_by(interval) %>% 
    summarise(count=n(), sum=sum(varC)) 
# interval count sum 
#1 (5,15]  1 -81.0 
#2 (15,50]  5 -64.7 
#3 (50,75]  3 92.6 
#4 (75,95]  2 104.1 
#5 (95,100]  1 256.3