2013-04-06 133 views
3

我有一個大的數據集,看起來像這樣:在data.frame創建使用相鄰列滯後

set.seed(1234) 
id <- c(3,3,3,5,5,7) 
amount <- c(24,48,60,84,96,175) 
start <- as.Date(c("2006-01-01","2009-12-09","2010-01-01","2006-04-24", "2009-12-09","2009-05-01")) 
end <- as.Date(c("2010-01-01","2010-01-01","2010-01-01","2009-12-09","2009-12-09", "2009-05-01"))    
noise <-rnorm(6) 
test <- data.frame(id,amount,start,end,noise)    

    id amount  start  end  noise 
    3  24 2006-01-01 2010-01-01 0.4978505 
    3  48 2009-12-09 2010-01-01 -1.9666172 
    3  60 2010-01-01 2010-01-01 0.7013559 
    5  84 2006-04-24 2009-12-09 -0.4727914 
    5  96 2009-12-09 2009-12-09 -1.0678237 
    7 175 2009-05-01 2009-05-01 -0.2179749 

但需要看起來像這樣:

id amount  start  end  noise switch 
    3  24 2006-01-01 2009-12-09 0.4978505  0 
    3  48 2009-12-09 2010-01-01 -1.9666172  1 
    3  60 2010-01-01 2010-01-01 0.7013559  2 
    5  84 2006-04-24 2009-12-09 -0.4727914  0 
    5  96 2009-12-09 2009-12-09 -1.0678237  1 
    7 175 2009-05-01 2009-05-01 -0.2179749  0 

也就是說,我會喜歡延遲開始的值,並用ID替換結束的值。其次,我想創建一個名爲'switch'的新變量,用於計算id的'數量'變化的次數,初始條件的第一個觀察結果爲== 0。我一直在使用ts()使滯後,它做什麼,我想在原則上嘗試過,但它產生的TS對象,而不是一個日期:

 out <- cbind(as.ts(test$start),lag(test$start)) 
     colnames(out) <- c("start","end") 
     cbind(as.ts(test$start),lag(test$start)) 

     as.ts(test$start) lag(test$start) 
      NA   13149 
      13149   14587 
      14587   14610 
      14610   13262 
      13262   14587 
      14587   14365 
      14365    NA 

所以lag(test$start)列是我到底應該是什麼樣子,但是應用通過id變量。所以我嘗試矢量化並將其應用於id變量:

 #make it a function 
     lagfun <- function(x){ 
      cbind(as.ts(x),lag(x)) 
     } 

     y <- unlist(tapply(start,id,lagfun))  

而這就是事情變得非常醜陋的地方。有沒有更好的方法來解決這個問題?

回答

5

如果你把你的時間序列在data.table,你可以在一行中實現這一點:

testDT[ , c("end", "switch") := 
      list(c(tail(start, -1), tail(end, 1)), cumsum(c(0, diff(amount) != 0))) 
     , by=id] 

這是細分:

# create your data.table object 
library(data.table) 
testDT <- data.table(test) 


# Modify `end` by taking the lag of start and the final date from end. 
# do this `by=id` 
testDT[, end := c(tail(start, -1), tail(end, 1)), by=id] 

# Count the ammount of times that each amount differs from the 
# previous ammount value. 
# Start this vector at 0, and take the cummulative sum. 
# also do this by id 
testDT[, switch := cumsum(c(0, diff(amount) != 0)), by=id] 

# this is the final result. 
testDT 
    id amount  start  end  noise switch 
1: 3  24 2006-01-01 2009-12-09 -1.2070657  0 
2: 3  48 2009-12-09 2010-01-01 0.2774292  1 
3: 3  60 2010-01-01 2010-01-01 1.0844412  2 
4: 5  84 2006-04-24 2009-12-09 -2.3456977  0 
5: 5  96 2009-12-09 2009-12-09 0.4291247  1 
6: 7 175 2009-05-01 2009-05-01 0.5060559  0 
+0

我應該總是以數據表。謝謝,你已經度過了我的一天! – kpeyton 2013-04-06 06:09:05

+0

嗯,我很高興能夠用這麼短的時間投入讓你的一天變得美好:) data.table令人驚歎,特別是'by =',我發現它最終節省了無數分鐘的編碼時間。 – 2013-04-06 06:11:35

+1

分鐘?幾小時,幾天!對於像我這樣的初學者來說,有一個很好的小插曲[here](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf) – kpeyton 2013-04-06 06:32:07