按小組計算多個連續事件w /起始年份

我是一個數年的潛伏者，但我終於找到了一些我無法用舊帖子找出的東西。我有包含上百個國家，多年的數據幀，並用二進制指示的事件變量：按小組計算多個連續事件w /起始年份

library('dplyr') 
library('data.table') 

country<-c("albania","albania","albania","albania","albania","albania","albania","albania","thailand","thailand","thailand","thailand","thailand","thailand","thailand","thailand") 
year<-c(1960,1961,1962,1963,1964,1965,1966,1967,1972,1973,1974,1975,1976,1977,1978,1979) 
event<-c(0,1,1,0,0,1,1,1,1,1,0,0,1,0,0,0) 
input<-data.frame(country=country, year=year, event=event) 

input  

    country year event 
1 albania 1960  0 
2 albania 1961  1 
3 albania 1962  1 
4 albania 1963  0 
5 albania 1964  0 
6 albania 1965  1 
7 albania 1966  1 
8 albania 1967  1 
9 thailand 1972  1 
10 thailand 1973  1 
11 thailand 1974  0 
12 thailand 1975  0 
13 thailand 1976  1 
14 thailand 1977  0 
15 thailand 1978  0 
16 thailand 1979  0

我想創建顯示多個連續的事件，每個國家有自己的時間新的數據幀和開始一年。例如：

output 

    country start duration 
1 albania 1961  2 
2 albania 1965  3 
3 thailand 1972  2 
4 thailand 1976  1

我讀過了，我所相信的是，大多數關於使用rle()和rleid()與dplyr和data.table按組計算連續的事件相關的帖子，但我不能讓他們到我想成爲的地方。

正在關注this example，我無法獲得一個國家多個事件長度的新數據框;不只是最大值，最小值等，並且忽略了我需要抓住事件的起始年份。試圖建立這個代碼來達到我想要的狀態給我留下了很多錯誤。「基本規範」爲dplyr例子似乎是一些出發點：

output <- input %>% 
group_by(country) %>% 
do({ 
tmp <- with(rle(.$event == 1), lengths[values]) 
data.frame(country= .$country, Max = if (length(tmp) == 0) 0 else max(tmp)) 
}) %>% 
slice(1L)

這顯然拉動最大，我掙扎着試圖改變它拉每個事件。

Following the data.table/rleid模型創建了一個新突變的變量，用於統計連續「事件」的持續時間，但我無法提取一個國家內多個事件的「結束」年份。也許一些滯後差異函數使用突變變量，然後提取負值的所有行？一旦結束事件的行被標記，起始年份將僅是當前年份 - 長度。這種方法的基本代碼是：

sum0 <- function(x) { x[x == 1] = sequence(with(rle(x), lengths[values == 1])); x } 
setDT(input)[, duration := sum0(event), by = country] 

input 

    country year event duration 
1: albania 1960  0  0 
2: albania 1961  1  1 
3: albania 1962  1  2 
4: albania 1963  0  0 
5: albania 1964  0  0 
6: albania 1965  1  1 
7: albania 1966  1  2 
8: albania 1967  1  3 
9: thailand 1972  1  1 
10: thailand 1973  1  2 
11: thailand 1974  0  0 
12: thailand 1975  0  0 
13: thailand 1976  1  1 
14: thailand 1977  0  0 
15: thailand 1978  0  0 
16: thailand 1979  0  0

有另一個7-10的帖子我看了，通過但沒有鏈接，因爲它們在本質上是兩個我引用相似。我想先感謝任何有任何建議的人。我希望我遵循所有提出問題的協議;我試圖小心並遵守規則。感謝你們所做的所有偉大的工作！你已經通過了5-6年的R和JAGS學習。

來源

2017-08-22 Josh Brinks

這裏是我會做什麼（離開dplyr出來的）：

setDT(input) 

input[, 
    if (first(event) == 1) .(year = first(year), N = .N) 
, by=.(country, g = rleid(country, event))][, !"g"] 

    country year N 
1: albania 1961 2 
2: albania 1965 3 
3: thailand 1972 2 
4: thailand 1976 1

效率不高，但希望很容易效仿。

來源

2017-08-22 14:34:35 Frank

比我的解決方案更好。 – mt1022

@ mt1022我不同意。我認爲，你的效率和慣用性更高。 – Frank

看起來不錯，我不得不走出一段時間。我會檢查什麼時候回來。謝謝你的幫助;我剛開始學習data.table，所以對我來說有些陌生。 –

這是你想要的東西：

library(data.table) 

setDT(input) 
input[, .(event = event[1], start = year[1], duration = .N), 
     by = .(country, rleidv(event))][event == 1][ 
      , c('event', 'rleidv') := NULL][] 

#  country start duration 
# 1: albania 1961  2 
# 2: albania 1965  3 
# 3: thailand 1972  2 
# 4: thailand 1976  1

正如評論指出弗蘭克，這個解決方案是由data.table在計算，這使得它更高效的優化。 if(cond) ...中的j表達式將不會被優化。

來源

2017-08-22 14:34:42 mt1022

正如我上面所說：看起來是正確的，我必須走出一段時間。我會檢查什麼時候回來。謝謝你的幫助;我剛開始學習data.table，所以對我來說有些陌生。 –

按小組計算多個連續事件w /起始年份

回答

相關問題