2015-04-19 55 views
5

我有兩個數據集(df1和df2),都由時間格式的值組成。我想做成「客觀的」。當通過c(「id1」,「id2」)合併兩個數據時,我想在非重疊時間保留「NA」。如何合併時間幀數據和非重疊部分留下NA?

DF1

id1 id2  click_timing 
1  11  2015-02-03 01:00:00  
1  11  2015-02-03 02:00:00  
1  12  2015-02-03 03:00:00  
1  12  2015-02-03 04:00:00  
1  13  2015-02-03 05:10:00  
2  34  2015-02-03 03:00:00  
2  34  2015-02-03 04:00:00  
2  36  2015-02-03 01:00:00 
...  

DF2

id1 id2  start       end 
1  11  2015-02-03 00:20:00  2015-02-03 00:40:00 
1  11  2015-02-03 00:50:00  2015-02-03 01:20:00 
1  13  2015-02-03 01:10:00  2015-02-03 01:40:00  
1  13  2015-02-03 04:50:00  2015-02-03 05:30:00  
2  34  2015-02-03 03:50:00  2015-02-03 04:10:00  
... 

目標輸出

id1 id2  click_timing    start     end 
1  11    NA    2015-02-03 00:20:00  2015-02-03 00:40:00 
1  11  2015-02-03 01:00:00 2015-02-03 00:50:00  2015-02-03 01:20:00 
1  11  2015-02-03 02:00:00   NA     NA 
1  12  2015-02-03 03:00:00   NA     NA 
1  12  2015-02-03 04:00:00   NA     NA 
1  13    NA    2015-02-03 01:10:00  2015-02-03 01:40:00  
1  13  2015-02-03 05:10:00 2015-02-03 04:50:00  2015-02-03 05:30:00 
2  34  2015-02-03 03:00:00   NA     NA  
2  34  2015-02-03 04:00:00  2015-02-03 03:50:00  2015-02-03 04:10:00 
2  36  2015-02-03 01:00:00   NA     NA 
...  
+0

我試圖與合併(DF1,DF2,通過= C( 「ID1」, 「ID2」 ))通過改變all.x = T和all.y = T。我不知道爲什麼它不起作用,但我想離開NA來獲得無與倫比的價值。 –

回答

1

棘手的問題!我認爲你必須通過所有click_timing值手動循環來計算每個單獨的click_timing值和時間週期(startend)之間的交叉點,然後使用所得到的索引的匹配,而一個附加的連接字段:

df1 <- data.frame(id1=c(1,1,1,1,1,2,2,2), id2=c(11,11,12,12,13,34,34,36), click_timing=as.POSIXct(c('2015-02-03 01:00:00','2015-02-03 02:00:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 05:10:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 01:00:00'))); 
df2 <- data.frame(id1=c(1,1,1,1,2), id2=c(11,11,13,13,34), start=as.POSIXct(c('2015-02-03 00:20:00','2015-02-03 00:50:00','2015-02-03 01:10:00','2015-02-03 04:50:00','2015-02-03 03:50:00')), end=as.POSIXct(c('2015-02-03 00:40:00','2015-02-03 01:20:00','2015-02-03 01:40:00','2015-02-03 05:30:00','2015-02-03 04:10:00'))); 
m <- sapply(1:nrow(df1), function(i) which(df1$id1[i]==df2$id1 & df1$id2[i] == df2$id2 & df1$click_timing[i]>=df2$start & df1$click_timing[i]<=df2$end)[1]); 
merge(cbind(df1,m=m),cbind(df2,m=1:nrow(df2)),by=c('id1','id2','m'),all=T)[-3]; 
## id1 id2  click_timing    start     end 
## 1 1 11    <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00 
## 2 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00 
## 3 1 11 2015-02-03 02:00:00    <NA>    <NA> 
## 4 1 12 2015-02-03 04:00:00    <NA>    <NA> 
## 5 1 12 2015-02-03 03:00:00    <NA>    <NA> 
## 6 1 13    <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00 
## 7 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00 
## 8 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00 
## 9 2 34 2015-02-03 03:00:00    <NA>    <NA> 
## 10 2 36 2015-02-03 01:00:00    <NA>    <NA> 

如果將永遠是其中單個click_timing值與多個startend對交叉的情況下,然後將此溶液將選擇發生較早一個(即具有df2下排索引)比其它匹配。

1

重新創建初始數據幀,使一些小的準備:

library(data.table) 
library(lubridate) 

df1<- fread("id1,id2,click_timing 
1,11,2015-02-03 01:00:00 
1,11,2015-02-03 02:00:00 
1,12,2015-02-03 03:00:00 
1,12,2015-02-03 04:00:00 
1,13,2015-02-03 05:10:00 
2,34,2015-02-03 03:00:00 
2,34,2015-02-03 04:00:00 
2,36,2015-02-03 01:00:00") 

# adding a redundant click_timing2 column to use as the end range for further foverlaps() function 
df1[, click_timing2:= click_timing] 
df1[,c("click_timing", "click_timing2"):= list(parse_date_time(click_timing, "%Y-%m-%d %T"), parse_date_time(click_timing2, "%Y-%m-%d %T"))] 


df2<- fread("id1,id2,start,end 
1,11,2015-02-03 00:20:00,2015-02-03 00:40:00 
1,11,2015-02-03 00:50:00,2015-02-03 01:20:00 
1,13,2015-02-03 01:10:00,2015-02-03 01:40:00 
1,13,2015-02-03 04:50:00,2015-02-03 05:30:00 
2,34,2015-02-03 03:50:00,2015-02-03 04:10:00") 

df2[,c("start","end") := list(parse_date_time(start, "%Y-%m-%d %T"), parse_date_time(end, "%Y-%m-%d %T"))] 
setkey(df2, id1, id2, start, end) 

解決方案:

df3<- foverlaps(df1, df2, by.x=c("id1", "id2", "click_timing", "click_timing2"), 
          by.y = c("id1", "id2", "start", "end"), type="within") 
objective_output<- merge(df3, df2, by = c("id1", "id2", "start", "end"), all = T) 
# deleting redundant click_timing2 column 
objective_output[,click_timing2:= NULL] 
# reordering columns 
setcolorder(objective_output, c(1,2,5,3,4)) 
#setting key using all columns and thus reordering all rows 
setkey(objective_output) 
objective_output 
#id1 id2  click_timing    start     end 
# 1: 1 11 2015-02-03 02:00:00    <NA>    <NA> 
# 2: 1 11    <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00 
# 3: 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00 
# 4: 1 12 2015-02-03 03:00:00    <NA>    <NA> 
# 5: 1 12 2015-02-03 04:00:00    <NA>    <NA> 
# 6: 1 13    <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00 
# 7: 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00 
# 8: 2 34 2015-02-03 03:00:00    <NA>    <NA> 
# 9: 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00 
#10: 2 36 2015-02-03 01:00:00    <NA>    <NA>