2013-03-04 122 views
4

我有大量的數據可能如下:獲得連續兩天的總和值

Date  rain code 
2009-04-01 0.0 0 
2009-04-02 0.0 0 
2009-04-03 0.0 0 
2009-04-04 0.7 1 
2009-04-05 54.2 1 
2009-04-06 0.0 0 
2009-04-07 0.0 0 
2009-04-08 0.0 0 
2009-04-09 0.0 0 
2009-04-10 0.0 0 
2009-04-11 0.0 0 
2009-04-12 5.3 1 
2009-04-13 10.1 1 
2009-04-14 6.0 1 
2009-04-15 8.7 1 
2009-04-16 0.0 0 
2009-04-17 0.0 0 
2009-04-18 0.0 0 
2009-04-19 0.0 0 
2009-04-20 0.0 0 
2009-04-21 0.0 0 
2009-04-22 0.0 0 
2009-04-23 0.0 0 
2009-04-24 0.0 0 
2009-04-25 4.3 1 
2009-04-26 42.2 1 
2009-04-27 45.6 1 
2009-04-28 12.6 1 
2009-04-29 6.2 1 
2009-04-30 1.0 1 

我試圖計算的連續陰雨值的總和,當代碼爲「1」,我需要分別有他們的總和。例如,我想從2009-04-122009-04-15獲得雨量值的總和。所以我試圖找到方法來定義代碼何時等於1,並且有連續的雨值我可以得到它們的總和。

對上述問題的任何幫助將不勝感激。

回答

4

一個簡單的解決方案是使用rle。但我懷疑那裏可能會有更多「優雅」的解決方案。

# assuming dd is your data.frame 
dd.rle <- rle(dd$code) 
# get start pos of each consecutive 1's 
start <- (cumsum(dd.rle$lengths) - dd.rle$lengths + 1)[dd.rle$values == 1] 
# how long do each 1's extend? 
ival <- dd.rle$lengths[dd.rle$values == 1] 
# using these two, compute the sum 
apply(as.matrix(seq_along(start)), 1, function(idx) { 
    sum(dd$rain[start[idx]:(start[idx]+ival[idx]-1)]) 
}) 

# [1] 54.9 30.1 111.9 

編輯:一個與rletapply更簡單的方法。

dd.rle <- rle(dd$code) 
# get the length of each consecutive 1's 
ival <- dd.rle$lengths[dd.rle$values == 1] 
# using lengths, construct a `factor` with levels = length(ival) 
levl <- factor(rep(seq_along(ival), ival)) 
# use these levels to extract `rain[code == 1]` and compute sum 
tapply(dd$rain[dd$code == 1], levl, sum) 

# 1  2  3 
# 54.9 30.1 111.9 
+0

+1表示將使用的'rle' – 2013-03-04 09:21:58

+1

有誰知道如何可以改變上面的解決方案爲一體的功能,這樣可以同時適用於許多數據嗎? – user1954153 2013-05-09 10:24:38

2

以下是獲得所需結果的矢量化方法。

df <- read.table(textConnection("Date  rain code\n2009-04-01 0.0 0\n2009-04-02 0.0 0\n2009-04-03 0.0 0\n2009-04-04 0.7 1\n2009-04-05 54.2 1\n2009-04-06 0.0 0\n2009-04-07 0.0 0\n2009-04-08 0.0 0\n2009-04-09 0.0 0\n2009-04-10 0.0 0\n2009-04-11 0.0 0\n2009-04-12 5.3 1\n2009-04-13 10.1 1\n2009-04-14 6.0 1\n2009-04-15 8.7 1\n2009-04-16 0.0 0\n2009-04-17 0.0 0\n2009-04-18 0.0 0\n2009-04-19 0.0 0\n2009-04-20 0.0 0\n2009-04-21 0.0 0\n2009-04-22 0.0 0\n2009-04-23 0.0 0\n2009-04-24 0.0 0\n2009-04-25 4.3 1\n2009-04-26 42.2 1\n2009-04-27 45.6 1\n2009-04-28 12.6 1\n2009-04-29 6.2 1\n2009-04-30 1.0 1"), 
    header = TRUE) 

df$cumsum <- cumsum(df$rain) 
df$diff <- c(diff(df$code), 0) 
df$result <- rep(NA, nrow(df)) 

if (nrow(df[df$diff == -1, ]) == nrow(df[df$diff == 1, ])) { 
    result <- df[df$diff == -1, "cumsum"] - df[df$diff == 1, "cumsum"] 
    df[df$diff == -1, "result"] <- result 
} else { 
    result <- c(df[df$diff == -1, "cumsum"], df[nrow(df), "cumsum"]) - df[df$diff == 1, "cumsum"] 
    df[df$diff == -1, "result"] <- result[1:length(result) - 1] 
    df[nrow(df), "result"] <- result[length(result)] 
} 

df 
##   Date rain code cumsum diff result 
## 1 2009-04-01 0.0 0 0.0 0  NA 
## 2 2009-04-02 0.0 0 0.0 0  NA 
## 3 2009-04-03 0.0 0 0.0 1  NA 
## 4 2009-04-04 0.7 1 0.7 0  NA 
## 5 2009-04-05 54.2 1 54.9 -1 54.9 
## 6 2009-04-06 0.0 0 54.9 0  NA 
## 7 2009-04-07 0.0 0 54.9 0  NA 
## 8 2009-04-08 0.0 0 54.9 0  NA 
## 9 2009-04-09 0.0 0 54.9 0  NA 
## 10 2009-04-10 0.0 0 54.9 0  NA 
## 11 2009-04-11 0.0 0 54.9 1  NA 
## 12 2009-04-12 5.3 1 60.2 0  NA 
## 13 2009-04-13 10.1 1 70.3 0  NA 
## 14 2009-04-14 6.0 1 76.3 0  NA 
## 15 2009-04-15 8.7 1 85.0 -1 30.1 
## 16 2009-04-16 0.0 0 85.0 0  NA 
## 17 2009-04-17 0.0 0 85.0 0  NA 
## 18 2009-04-18 0.0 0 85.0 0  NA 
## 19 2009-04-19 0.0 0 85.0 0  NA 
## 20 2009-04-20 0.0 0 85.0 0  NA 
## 21 2009-04-21 0.0 0 85.0 0  NA 
## 22 2009-04-22 0.0 0 85.0 0  NA 
## 23 2009-04-23 0.0 0 85.0 0  NA 
## 24 2009-04-24 0.0 0 85.0 1  NA 
## 25 2009-04-25 4.3 1 89.3 0  NA 
## 26 2009-04-26 42.2 1 131.5 0  NA 
## 27 2009-04-27 45.6 1 177.1 0  NA 
## 28 2009-04-28 12.6 1 189.7 0  NA 
## 29 2009-04-29 6.2 1 195.9 0  NA 
## 30 2009-04-30 1.0 1 196.9 0 111.9 
+0

您可以展開此解決方案來總結數據框中的多個列嗎?甚至不同的功能? (最小,最大等)? – RegressForward 2016-08-05 03:10:15

+0

此延伸到其他的功能下列命令是有用: groups_of_runs <-rep(cumsum(R $長度)中,r $長度)) 由於現在每個組具有自己獨特的運行。 – RegressForward 2016-08-05 17:31:31