2013-05-17 79 views
1

當我在一個日期系列和某個日期在一個但不是所有組中缺少某些日期的數據彙總數據時遇到問題。R按時間序列聚合,某些組缺少日期

dates <- seq.Date(as.Date("2010-01-01"), by=7, length.out=5) 
dates.2 <- dates[-2] 
all.dates <- c(dates, dates, dates.2, dates.2) 
subgroups <- c(rep("a", 5), rep("b", 5), rep("c", 4), rep("d", 4)) 
groups <- c(rep("X", 10), rep("Y", 8)) 
set.seed(2) 

    df.1 <- data.frame(Date = all.dates, 
    Group = groups, 
    Subgrp = subgroups, 
    Cost = runif(18,100,200) 
) 
df.1 

     Date Group Subgrp  Cost 
1 2010-01-01  X  a 118.4882 
2 2010-01-08  X  a 170.2374 
3 2010-01-15  X  a 157.3326 
4 2010-01-22  X  a 116.8052 
5 2010-01-29  X  a 194.3839 
6 2010-01-01  X  b 194.3475 
7 2010-01-08  X  b 112.9159 
8 2010-01-15  X  b 183.3449 
9 2010-01-22  X  b 146.8019 
10 2010-01-29  X  b 154.9984 
11 2010-01-01  Y  c 155.2674 
12 2010-01-15  Y  c 123.8895 
13 2010-01-22  Y  c 176.0513 
14 2010-01-29  Y  c 118.0820 
15 2010-01-01  Y  d 140.5282 
16 2010-01-15  Y  d 185.3548 
17 2010-01-22  Y  d 197.6398 
18 2010-01-29  Y  d 122.5825 

> ag.1 <- aggregate(Cost ~ Group + Date, FUN=sum, data=df.1) 
> ag.1 
    Group  Date  Cost 
1  X 2010-01-01 312.8357 
2  Y 2010-01-01 295.7956 
3  X 2010-01-08 283.1533 
4  X 2010-01-15 340.6775 
5  Y 2010-01-15 309.2443 
6  X 2010-01-22 263.6070 
7  Y 2010-01-22 373.6912 
8  X 2010-01-29 349.3823 
9  Y 2010-01-29 240.6646 

集團Y2010-01-08沒有付款,但ag.1對象是在此日期爲組Y沉默。我想ag.1有一排反映這一點:

> ag.1 
     Group  Date  Cost 
    1  X 2010-01-01 312.8357 
    2  Y 2010-01-01 295.7956 
    3  X 2010-01-08 283.1533 
    3a Y 2010-01-08 0.0000 
    4  X 2010-01-15 340.6775 
    5  Y 2010-01-15 309.2443 

我在aggregate函數試圖na.omit=na.pass但(1)我真的不知道這是什麼一樣,(2)它並沒有改變輸出。

建議您不要使用aggregate,但願意使用基本軟件包。

回答

2

expand.grid可用於填寫缺少的條目。

df.2 <- expand.grid(Date = unique(dates),Group = unique(groups)) 
df <- merge(df.1,df.2,all=TRUE) 

aggregate(Cost ~ Group + Date, FUN=sum, data=df, na.action=na.pass) 

編輯:隨着OP的暗示下,我找到了合適的調整到aggregate通話。

Group  Date  Cost 
1  X 2010-01-01 312.8357 
2  Y 2010-01-01 295.7956 
3  X 2010-01-08 283.1533 
4  Y 2010-01-08  NA 
5  X 2010-01-15 340.6775 
6  Y 2010-01-15 309.2443 
7  X 2010-01-22 263.6070 
8  Y 2010-01-22 373.6912 
9  X 2010-01-29 349.3823 
10  Y 2010-01-29 240.6646 
+1

我敢打賭,'na.omit'的一個很好的選擇可以用於聚合的公式語法。 – Hugh

1

1)只要任何日期有是有日期,那麼這樣做至少有一個組:

> as.data.frame(xtabs(Cost ~ Date + Group, df.1), responseName = "Cost") 
     Date Group  Cost 
1 2010-01-01  X 312.8357 
2 2010-01-08  X 283.1533 
3 2010-01-15  X 340.6775 
4 2010-01-22  X 263.6070 
5 2010-01-29  X 349.3823 
6 2010-01-01  Y 295.7956 
7 2010-01-08  Y 0.0000 
8 2010-01-15  Y 309.2443 
9 2010-01-22  Y 373.6912 
10 2010-01-29  Y 240.6646 

其實上面的xtabs部分可能是你所需要的如果這個結構是確定的:

> xtabs(Cost ~ Date + Group, df.1) 
      Group 
Date    X  Y 
    2010-01-01 312.8357 295.7956 
    2010-01-08 283.1533 0.0000 
    2010-01-15 340.6775 309.2443 
    2010-01-22 263.6070 373.6912 
    2010-01-29 349.3823 240.6646 

2)如果有日期安排沒有任何組織列入各級的條目,然後將日期轉換爲一個因素與非出現日期:

> # define levels to be all weeks between minimum date and 2010-02-05 
> levs <- as.character(seq(min(df.1$Date), as.Date("2010-02-05"), by = 7)) 
> df.2 <- transform(df.1, Date = factor(Date, sort(unique(levs)))) 
> 
> # now repeat using df.2 
> as.data.frame(xtabs(Cost ~ Date + Group, df.2), responseName = "Cost") 
     Date Group  Cost 
1 2010-01-01  X 312.8357 
2 2010-01-08  X 283.1533 
3 2010-01-15  X 340.6775 
4 2010-01-22  X 263.6070 
5 2010-01-29  X 349.3823 
6 2010-02-05  X 0.0000 
7 2010-01-01  Y 295.7956 
8 2010-01-08  Y 0.0000 
9 2010-01-15  Y 309.2443 
10 2010-01-22  Y 373.6912 
11 2010-01-29  Y 240.6646 
12 2010-02-05  Y 0.0000 

> xtabs(Cost ~ Date + Group, df.2) 
      Group 
Date    X  Y 
    2010-01-01 312.8357 295.7956 
    2010-01-08 283.1533 0.0000 
    2010-01-15 340.6775 309.2443 
    2010-01-22 263.6070 373.6912 
    2010-01-29 349.3823 240.6646 
    2010-02-05 0.0000 0.0000 
+0

+1。我認爲第二個佈局更具可讀性。 – Frank