2015-06-19 63 views
2

我有一個關於R中日期操作的問題。我查找了幾天,但無法在線找到任何幫助。我有一個數據集,我有id和兩個日期,另一個數據集具有相同的id變量,日期和價格。例如:如果在R中另一個數據集中的兩個變量所定義的範圍內,則從一個數據集獲取變量值

x = data.frame(id = c("A","B","C","C"), 
 
       date1 = c("29/05/2013", "23/08/2011", "25/09/2011", "18/11/2011"),  
 
       date2 = c("10/07/2013", "04/10/2011", "10/11/2011", \t "15/12/2011")) 
 
> x 
 
    id  date1  date2 
 
1 A 29/05/2013 10/07/2013 
 
2 B 23/08/2011 04/10/2011 
 
3 C 25/09/2011 10/11/2011 
 
4 C 18/11/2011 15/12/2011 
 

 
y = data.frame(id = c("A","A","A","B","B","B","B","B","B","C","C","C"), 
 
       date = c("21/02/2013", "19/06/2013", \t "31/07/2013", \t "07/10/2011", \t "16/01/2012", \t "10/07/2012","20/09/2012", \t "29/11/2012", \t \t "15/08/2014", \t "27/09/2011", \t "27/01/2012", \t "09/03/2012"), 
 
       price = c(126,109,111,14,13.8,14.1,14, \t 14.4,143,102,114,116)) 
 
> y 
 
    id  date price 
 
1 A 21/02/2013 126.0 
 
2 A 19/06/2013 109.0 
 
3 A 31/07/2013 111.0 
 
4 B 07/10/2011 14.0 
 
5 B 16/01/2012 13.8 
 
6 B 10/07/2012 14.1 
 
7 B 20/09/2012 14.0 
 
8 B 29/11/2012 14.4 
 
9 B 15/08/2014 143.0 
 
10 C 27/09/2011 102.0 
 
11 C 27/01/2012 114.0 
 
12 C 09/03/2012 116.0

我想要做的就是尋找在數據集X兩個日期,如果在數據集中爲y的日期內通過在數據集X中的兩個日期確定對於相同的ID,爲該ID和日期選擇價格的價值。如果沒有它缺失。所以基本上我想有這樣一個最後的數據集來結束:

final = data.frame(id = c("A","B","C","C"), 
 
        date1 = c("29/05/2013", "23/08/2011", "25/09/2011", "18/11/2011"),  
 
        date2 = c("10/07/2013", "04/10/2011", "10/11/2011", "15/12/2011"), 
 
        date = c("19/06/2013", "NA", \t "27/09/2011", \t "NA"), 
 
        price = c(109,"NA",102,"NA") ) 
 

 
> final 
 
    id  date1  date2  date price 
 
1 A 29/05/2013 10/07/2013 19/06/2013 109 
 
2 B 23/08/2011 04/10/2011 20/09/2012 14 
 
3 C 25/09/2011 10/11/2011 27/09/2011 102 
 
4 C 18/11/2011 15/12/2011   NA NA

任何幫助將非常感激。

+0

也參見:(http://stackoverflow.com/questions/24480031/roll-join-with-start-end [與開始/結束窗口輥加入] -窗口) – MrFlick

回答

2

這裏基礎上,data.table包裝的優秀foverlaps的解決方案。

library(data.table) 
## coerce characters to dates (numeric) 
setDT(x)[,c("date1","date2"):=list(as.Date(date1,"%d/%m/%Y"), 
            as.Date(date2,"%d/%m/%Y"))] 
## and a dummy date since foverlaps looks for a start,end columns 
setDT(y)[,c("date1"):=as.Date(date,"%d/%m/%Y")][,date:=date1] 
## y must be keyed 
setkey(y,id,date,date1) 
foverlaps(x,y,by.x=c("id","date1","date2"))[, 
      list(id,i.date1,date2,date,price)] 

    id i.date1  date2  date price 
1: A 2013-05-29 2013-07-10 2013-06-19 109 
2: B 2011-08-23 2011-10-04  <NA> NA 
3: C 2011-09-25 2011-11-10 2011-09-27 102 
4: C 2011-11-18 2011-12-15  <NA> NA 

PS:結果不完全相同,因爲您的預期輸出中有錯誤。

1

我會採取兩步。首先,通過ID加入每個DF(見this link更多細節上加入),具體如下:

df <- merge(x, y, by = "id") 

現在你應該有一個完整的數據集,甚至更多的條目比你提出的要求。要減少您的標準,請嘗試:

df <- filter(df, date > date1, date < date2) 

我相信應該有效。

編輯:如果你真的想要那裏的NA值而不是僅僅刪除那些數據,它會讓它變得更加多毛。我會做什麼在這種情況下,而不是過濾工序,試試這個:

df$price[date < date1] <- NA 
df$price[date > date2] <- NA 
df$date[date < date1] <- NA 
df$date[date > date2] <- NA 
1

或用lubridatebase R

m <- merge(x, y, by='id') 
d_range <- m$date1 %--% m$date2 
m2 <- m[m$date %within% d_range, ] 
res <- merge(x, m2, by=c('id', 'date1', 'date2'), all.x=T) 

由於@Isaac建議,合併有助於使過程更快。來自lubridate包的運營商%--%會創建一個間隔。運營商%within%測試LHS對象是否位於RHS範圍內。

id  date1  date2  date price 
1 A 2013-05-29 2013-07-10 2013-06-19 109 
2 B 2011-08-23 2011-10-04  <NA> NA 
3 C 2011-09-25 2011-11-10 2011-09-27 102 
4 C 2011-11-18 2011-12-15  <NA> NA 

數據

x = data.frame(id = c("A","B","C","C"), 
       date1 = c("29/05/2013", "23/08/2011", "25/09/2011", "18/11/2011"),  
       date2 = c("10/07/2013", "04/10/2011", "10/11/2011", "15/12/2011")) 

y = data.frame(id = c("A","A","A","B","B","B","B","B","B","C","C","C"), 
       date = c("21/02/2013", "19/06/2013", "31/07/2013", "07/10/2011", "16/01/2012", "10/07/2012","20/09/2012", "29/11/2012",  "15/08/2014", "27/09/2011", "27/01/2012", "09/03/2012"), 
       price = c(126,109,111,14,13.8,14.1,14, 14.4,143,102,114,116)) 

x[c('date1', 'date2')] <- lapply(x[c('date1', 'date2')], dmy) 
y['date'] <- dmy(y[,'date']) 
相關問題