2015-02-10 56 views
0

我試圖解決dplyr下面的問題,並設法取得一些進展,但我在某些時候面臨的問題很少。cumsum和if在dplyr條件在r不給予預期的輸出

問題陳述

在每個組(由ID分組)的,如果相同的ID的當前HID和先前HID是不同的,並且間隔< 30,則罰列應顯示來自金額的值。在所有其他情況下,它應該顯示0(其他條件可能意味着要麼的HID是相同,或HID之不同,但間隔> = 30)

數據

"ID","DaysToEvent","HID","Interval","Amount" 
2197560,16369,"011",29,90105 
2197560,16494,"121",29,50526 
2197560,16509,"121",29,194568 
2197560,16569,"001",31,27236 
2197560,16577,"128",29,17309 
2197578,14447,"001",29,17276 
2197578,14468,"021",29,12661 
2197578,14489,"001",31,15015 
2197578,14517,"001",29,19000 
2197578,14517,"02P",29,19001 
2197578,14517,"001",31,19002 
2197578,14517,"001",29,19003 
2197578,14517,"001",29,19004 

我的代碼

mycoredata2009 = read.csv('path/to/abovefile.csv') 
CumulativeCumulativeCost = 0; 
mycoredata2009 = mycoredata2009 %>% 
    group_by(ID) %>% 
    mutate(Penalty = ifelse(((HID != lag(HID)) & Interval < 30) ,Amount,0)) %>% 
    mutate(CumulativeCost=cumsum(as.numeric(Penalty))) %>% 
    CumulativeCumulativeCost = cumsum(as.numeric(CumulativeCost)) %>% 
    cat(paste("For group with ID==",ID,"CumulativeCost==", CumulativeCost,sep="")) 
    mycoredata2009 = as.data.frame(mycoredata2009) 

問題,我目前正面臨着

然而,有幾個問題與代碼

  1. 的刑罰欄顯示金額,即使當前HID 和以前的HID是相同的數值。(正常工作的另外兩個 條件)

  2. 這應該是 運行成本的罰金列總是顯示NA

  3. 在每個組的結尾處的CumulativeCost柱,我想打印個的CumulativeCost在 組,並保持插入該 組的ID和CumulativeCost成最終輸出數據幀

  4. 我也希望有一個稱爲CumulativeCumulativeCost 可變其中,顧名思義是每個CumulativeCost 的運行總和組。

接收的輸出

ID DaysToEvent HID Interval Amount Penalty CumulativeCost 
1 2197560  16369  011    29 90105  NA    NA 
2 2197560  16494  121    29 50526 50526    NA 
3 2197560  16509  121    29 194568 194568    NA 
4 2197560  16569  001    31 27236  0    NA 
5 2197560  16577  128    29 17309 17309    NA 
6 2197578  14447  001    29 17276  NA    NA 
7 2197578  14468  021    29 12661 12661    NA 
8 2197578  14489  001    31 15015  0    NA 
9 2197578  14517  001    29 19000 19000    NA 
10 2197578  14517  02P    29 19001 19001    NA 
11 2197578  14517  001    31 19002  0    NA 
12 2197578  14517  001    29 19003 19003    NA 
13 2197578  14517  001    29 19004 19004    NA 

預期輸出(手算)

ID DaysToEvent HID Interval Amount Penalty CumulativeCost 
1 2197560  16369  011    29 90105  NA    NA 
2 2197560  16494  121    29 50526 50526   50526 
3 2197560  16509  121    29 194568  0   50526 
4 2197560  16569  001    31 27236  0   50526 
5 2197560  16577  128    29 17309 17309   67835 
6 2197578  14447  001    29 17276  NA    NA 
7 2197578  14468  021    29 12661 12661   12661 
8 2197578  14489  001    31 15015  0   12661 
9 2197578  14517  001    29 19000  0   12661 
10 2197578  14517  02P    29 19001 19001   31662 
11 2197578  14517  001    31 19002  0   31662 
12 2197578  14517  001    29 19003  0   31662 
13 2197578  14517  001    29 19004  0   31662 

回答

2

基於預期的輸出,之後我們創建使用邏輯狀態中的 「點球」 列(HID!=lag(HID,...)),將每個組的「罰分」列中的第一個觀察結果更改爲「NA」,得到其他行的,並追加NA它(c(NA, cumsum(...))打造的 「CumulativeCost」

library(dplyr) 
mycoredata2009%>% 
    group_by(ID) %>% 
    mutate(Penalty= ifelse(HID!=lag(HID, default=0) & Interval<30, Amount, 0), 
       Penalty=ifelse(row_number()==1L, NA, Penalty), 
       CumulativeCost=c(NA, cumsum(Penalty[-1L]))) 
    #  ID DaysToEvent HID Interval Amount Penalty CumulativeCost 
    #1 2197560  16369 011  29 90105  NA    NA 
    #2 2197560  16494 121  29 50526 50526   50526 
    #3 2197560  16509 121  29 194568  0   50526 
    #4 2197560  16569 001  31 27236  0   50526 
    #5 2197560  16577 128  29 17309 17309   67835 
    #6 2197578  14447 001  29 17276  NA    NA 
    #7 2197578  14468 021  29 12661 12661   12661 
    #8 2197578  14489 001  31 15015  0   12661 
    #9 2197578  14517 001  29 19000  0   12661 
    #10 2197578  14517 02P  29 19001 19001   31662 
    #11 2197578  14517 001  31 19002  0   31662 
    #12 2197578  14517 001  29 19003  0   31662 
    #13 2197578  14517 001  29 19004  0   31662 

或者我們可以刪除ifelse

mycoredata2009 %>% 
    group_by(ID) %>% 
    mutate(Penalty=NA^(row_number()==1L)*(HID!=lag(HID, default=0) & 
        Interval<30)*Amount, 
      CumulativeCost=c(NA, cumsum(Penalty[-1L]))) 

或者使用data.table

library(data.table) #data.table_1.9.5 
setDT(mycoredata2009)[, { 
    tmp = NA^(1:.N==1L)*(HID!= shift(HID, fill=0) & Interval<30)*Amount 
    c(.SD, list(Penalty=tmp, CumulativeCost=c(NA, cumsum(tmp[-1L])))) 
    },ID] 

    #1: 2197560  16369 011  29 90105  NA    NA 
    #2: 2197560  16494 121  29 50526 50526   50526 
    #3: 2197560  16509 121  29 194568  0   50526 
    #4: 2197560  16569 001  31 27236  0   50526 
    #5: 2197560  16577 128  29 17309 17309   67835 
    #6: 2197578  14447 001  29 17276  NA    NA 
    #7: 2197578  14468 021  29 12661 12661   12661 
    #8: 2197578  14489 001  31 15015  0   12661 
    #9: 2197578  14517 001  29 19000  0   12661 
#10: 2197578  14517 02P  29 19001 19001   31662 
#11: 2197578  14517 001  31 19002  0   31662 
#12: 2197578  14517 001  29 19003  0   31662 
#13: 2197578  14517 001  29 19004  0   31662