2013-03-30 37 views
0

我試圖創建一個每日時間序列數據集從目前只觀察到的只是週期性。我可以成功地爲單個案例執行所需的操作,但無法解決如何縮放到整個數據集的問題。例如:可擴展結轉在R創建每日時間序列

 UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300) 
     STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE', 
        'ACTIVE','INACTIVE') 
     TERMINATED <- as.Date(c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18', 
         '2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15')) 
     START <- as.Date(c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30', 
        '2007-10-29','2008-02-05','2008-06-30','2009-02-07')) 
     STOP <- as.Date(c('2008-12-05','2012-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','2012-12-31', 
        '2008-02-04','2008-06-29','2009-02-06','2012-12-31')) 
     TEST <- data.frame(UNIT,STATUS,TERMINATED,START,STOP) 
     TEST     

這是超過時間間隔上觀察單位:

UNIT STATUS TERMINATED  START  STOP 
1 100 ACTIVE 1999-07-06 2007-04-23 2008-12-05 
2 100 INACTIVE 2008-12-05 2008-12-06 2012-12-31 
3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31 
4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18 
5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28 
6 200 ACTIVE 2008-08-18 2010-11-29 2010-12-29 
7 200 INACTIVE 2008-08-18 2010-12-30 2012-12-31 
8 300 ACTIVE 2006-09-19 2007-10-29 2008-02-04 
9 300 ACTIVE 2006-09-19 2008-02-05 2008-06-29 
10 300 ACTIVE 2006-09-19 2008-06-30 2009-02-06 
11 300 INACTIVE 1999-03-15 2009-02-07 2012-12-31    

我想利用各個單元和重複的「STATUS」的值和「終止」(與N以外沿大數據集中的協變量)每日,在START和END日期的整個範圍內。這樣做是爲了一個記錄....

 A <- seq(TEST$START[1], TEST$STOP[1], "days") #vector of relevant date sequences 

     #keeping the old data, now with daily date "fill" 
     B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2]) 
     C <- data.frame(A,B) 

     #carry forward observations on covariates through date range 
     TEST[-c(4,5)][1,] #note terminated has the proper date status: 
     UNIT STATUS TERMINATED 
     1 100 ACTIVE 1999-07-06 

     #now the TERMINATED loses its 'date' status for some reason 
     C[-c(1)][1,] <- TEST[-c(4,5)][1,] 
     D <- na.locf(C) 
     colnames(D)[2:4] <-colnames(TEST)[1:3] 
     colnames(D)[1] <- "DATE" 
     head(D) 

     DATE UNIT STATUS TERMINATED 
1 2007-04-23 100  1  10778 
2 2007-04-24 100  1  10778 
3 2007-04-25 100  1  10778 
4 2007-04-26 100  1  10778 
5 2007-04-27 100  1  10778 
6 2007-04-28 100  1  10778 

第一行的意見被複制在START的範圍到另一端,創建一個新的載體:在整個期間每天的時間序列。我想爲第2行做這個,通過UNIT分析將它綁定到D等等。我曾在一個不成功的嘗試概括寫了一個與na.locf循環:

for(i in 1:nrow(TEST)){ 
    for(j in 0:nrow(TEST)-1) { 
    A <- seq(TEST$START[i], TEST$STOP[i], "days") 

    B <- matrix(NA, length(A), dim(TEST[-c(4,5)])[2]) 
    C <- data.frame(A,B) 

    C[-c(1)][1,] <- TEST[-c(4,5)][i,] 
    assign(paste("D",i, sep=""),na.locf(C)) 

    #below here the code does not work. R does not recognize i and j as I intend 
    #I haven't been able to overcome this using assign, evaluate etc. 
    colnames(Di)[2:4] <-colnames(TEST)[1:3] 
    colnames(Di)[1] <- "DATE" 

    D0 <- matrix(NA, 1, dim(Di)[2]) 
    assign(paste("D", j, sep = ""),Dj) 
    rbind(Di,Dj) 

    } 
    }    

跟單記錄「解決方案」最明顯的問題是處理「終止」日期。就在使用na.locf之前,它失去了日期狀態。

我希望有一個更好的方式來看待這個,我剛剛埋頭於無知的複雜化。

回答

2

在SQL中執行起來相對比較容易,因此您可以使用sqldf, ,它將data.frames視爲SQL表。

dates <- data.frame(date = seq.Date(min(TEST$START), max(TEST$STOP), by = 1)) 
library(sqldf) 
result <- sqldf(" 
    SELECT * 
    FROM TEST, dates 
    WHERE START <= date AND date <= STOP 
") 
head(result) 

如果數據很大,它可能是值得的數據存儲在數據庫中, 並做計算存在。

# With SQLite, a database is just a file 
library(RSQLite) 
connection <- dbConnect(SQLite(), "/tmp/test.db") 

# Copy the data.frames to the "Test" and "Dates" table. 
# When transfering data across systems, it is often easier 
# to convert dates to strings. 
convert_dates <- function(d) { 
    as.data.frame(lapply( 
    d, 
    function(u) if("Date" %in% class(u)) as.character(u) else u 
)) 
} 
dbWriteTable(connection, "Test", convert_dates(TEST), row.names = FALSE) 
dbWriteTable(connection, "Dates", convert_dates(dates), row.names = FALSE) 

# Check how many rows the query has: it could be 
# that the result does not fit in memory 
dbGetQuery(connection, " 
    SELECT COUNT(*) 
    FROM Test, Dates 
    WHERE start <= date AND date <= stop 
") 

# If it is reasonable, retrieve all the data 
dbGetQuery(connection, " 
    SELECT * 
    FROM Test, Dates 
    WHERE start <= date AND date <= stop 
") 

# If not, only retrieve what you need 
dbGetQuery(connection, " 
    SELECT * 
    FROM Test, Dates 
    WHERE start <= date AND date <= stop 
    AND '2013-04-01' <= date AND date <= '2013-04-30' 
") 
+0

太好了,謝謝!這對於數據管理非常有用。 –

+0

關於在大數據框中使用此包的任何提示?我目前正在得到「無法分配N號大小的矢量」問題 –

+0

如果數據太大,您可以在數據庫中做所有事情: 我已經相應地更新了我的答案。 (但是,數據量可能會在未來的某個日期解釋得很遠,例如, ,例如,在您的示例中爲「4712-12-31」。) –