使用重複鍵滾動加入data.table

我試圖瞭解rolling joinsdata.table。最後給出了重現這一點的數據。使用重複鍵滾動加入data.table

在機場鑑於交易的data.table，在給定時間：

> dt 
    t_id airport thisTime 
1: 1  a  5.1 
2: 3  a  5.1 
3: 2  a  6.2

（注意： 3具有相同的機場和時間）

和航班的查找表從機場出發：

> dt_lookup 
    f_id airport thisTime 
1: 1  a  6 
2: 2  a  6 
3: 1  b  7 
4: 1  c  8 
5: 2  d  7 
6: 1  d  9 
7: 2  e  8 

> tables() 
    NAME  NROW NCOL MB COLS     KEY    
[1,] dt   3 3 1 t_id,airport,thisTime airport,thisTime 
[2,] dt_lookup 7 3 1 f_id,airport,thisTime airport,thisTime

我想匹配所有的交易，所有下一個可能的航班從該機場出發，給：

t_id airport thisTime f_id 
     1  a  6 1 
     1  a  6 2 
     3  a  6 1 
     3  a  6 2

所以，我想這會工作：

> dt[dt_lookup, nomatch=0,roll=Inf] 
    t_id airport thisTime f_id 
1: 3  a  6 1 
2: 3  a  6 2

但還沒有恢復交易t_id == 1。

從the documentation它說：

通常情況下，應該有X的鑰匙沒有重複，...

不過，我有我的 'X鍵' 複製（即airport & thisTime），並且無法完全看到/理解是什麼意思t_id = 1從輸出中被移除。

任何人都可以說明爲什麼t_id = 1沒有被返回，我怎樣才能讓連接工作，當我有重複？

數據

library(data.table) 
dt <- data.table(t_id = seq(1:3), 
       airport = c("a","a","a"), 
       thisTime = c(5.1,6.2, 5.1), key=c("airport","thisTime")) 

dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)), 
         airport = c("a","b","c","d", 
           "a","d","e"), 
         thisTime = c(6,7,8,9, 
           6,7,8), key=c("airport","thisTime"))

來源

2015-08-14 tospig

是t_id = 1不顯示輸出是因爲滾動加入的原因需要將鍵組合時最後的行。從文檔（重點礦山）：

適用於最後一個加入柱，一般的日期，但可以是任何有序變量，不規則，包括空白。如果roll = TRUE並且i的行與除最後一個x連接列以外的所有行匹配，並且它在最後一個連接列中的值落在一個間隔中（包括該組中的最後一個觀察值x），則當前值在x中是向前滾動。使用修改後的二分查找，此操作特別快速。 該操作也稱爲最後一次觀察，前進爲（LOCF）。

讓我們考慮稍微大的數據集：

> DT 
    t_id airport thisTime 
1: 1  a  5.1 
2: 4  a  5.1 
3: 3  a  5.1 
4: 2  d  6.2 
5: 5  d  6.2 
> DT_LU 
    f_id airport thisTime 
1: 1  a  6 
2: 2  a  6 
3: 2  a  8 
4: 1  b  7 
5: 1  c  8 
6: 2  d  7 
7: 1  d  9

當您執行滾動加盟就像在你的問題：

DT[DT_LU, nomatch=0, roll=Inf]

你：

t_id airport thisTime f_id 
1: 3  a  6 1 
2: 3  a  6 2 
3: 3  a  8 2 
4: 5  d  7 2 
5: 5  d  9 1

由於你可以看到，從兩個關鍵組合a, 5.1和d, 6.2最後一行用於連接的數據表。由於您使用Inf作爲滾動值，因此所有未來值都包含在生成的數據表中。當您使用：

DT[DT_LU, nomatch=0, roll=1]

你看到的，只有在未來的第一個值包括：

t_id airport thisTime f_id 
1: 3  a  6 1 
2: 3  a  6 2 
3: 5  d  7 2

如果你想f_id的針對的airport & thisTime所有組合DT$thisTime低於DT_LU$thisTime，您可以通過創建一個新變量（或替換現有的thisTime）通過ceiling功能。在這裏我創建一個新的變量thisTime2再一個例子做一個正常的加入與DT_LU：

DT[, thisTime2 := ceiling(thisTime)] 
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]

這給：

t_id airport thisTime thisTime2 f_id 
1: 1  a  5.1   6 1 
2: 4  a  5.1   6 1 
3: 3  a  5.1   6 1 
4: 1  a  5.1   6 2 
5: 4  a  5.1   6 2 
6: 3  a  5.1   6 2 
7: 2  d  6.2   7 2 
8: 5  d  6.2   7 2

適用於你所提供的數據：

> dt[, thisTime2 := ceiling(thisTime)] 
> setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0] 

    t_id airport thisTime thisTime2 f_id 
1: 1  a  5.1   6 1 
2: 3  a  5.1   6 1 
3: 1  a  5.1   6 2 
4: 3  a  5.1   6 2

如果您想要包含未來值而不僅僅是第一個o NE，則需要有所不同的方法，而您將需要i.col功能（這還沒有記錄）：

：首先設置鍵只有airport列：

setkey(DT, airport) 
setkey(DT_LU, airport)

：使用j的i.col功能（這還沒有記錄），以得到你想要的東西如下：

DT1 <- DT_LU[DT, .(tid = i.t_id, 
        tTime = i.thisTime, 
        fTime = thisTime[i.thisTime < thisTime], 
        fid = f_id[i.thisTime < thisTime]), 
      by=.EACHI]

這給你：

> DT1 
    airport tid tTime fTime fid 
1:  a 1 5.1  6 1 
2:  a 1 5.1  6 2 
3:  a 1 5.1  8 2 
4:  a 4 5.1  6 1 
5:  a 4 5.1  6 2 
6:  a 4 5.1  8 2 
7:  a 3 5.1  6 1 
8:  a 3 5.1  6 2 
9:  a 3 5.1  8 2 
10:  d 2 6.2  7 2 
11:  d 2 6.2  9 1 
12:  d 5 6.2  7 2 
13:  d 5 6.2  9 1

一些解釋：如果當您連接兩個數據表在同一COLUMNNAMES使用，你可以參照數據表中的列i由COLUMNNAMES與i.前面。現在可以比較thisTime從DT和thisTime從DT_LU。使用by = .EACHI，您可以確保所有與條件成立的組合都包含在生成的數據表中。

或者，也可以達到同樣的用：

DT2 <- DT_LU[DT, .(airport=i.airport, 
        tid=i.t_id, 
        tTime=i.thisTime, 
        fTime=thisTime[i.thisTime < thisTime], 
        fid=f_id[i.thisTime < thisTime]), 
      allow.cartesian=TRUE]

可以得到相同的結果：

> identical(DT1, DT2) 
[1] TRUE

如果你只希望包括一定的邊界內的未來值，你可以使用：

DT1 <- DT_LU[DT, 
      { 
       idx = i.thisTime < thisTime & thisTime - i.thisTime < 2 
       .(tid = i.t_id, 
       tTime = i.thisTime, 
       fTime = thisTime[idx], 
       fid = f_id[idx]) 
       }, 
      by=.EACHI]

這給：

> DT1 
    airport tid tTime fTime fid 
1:  a 1 5.1  6 1 
2:  a 1 5.1  6 2 
3:  a 4 5.1  6 1 
4:  a 4 5.1  6 2 
5:  a 3 5.1  6 1 
6:  a 3 5.1  6 2 
7:  d 2 6.2  7 2 
8:  d 5 6.2  7 2

當您將其與以前的結果進行比較時，您會看到現在第3，6，9，10和12行已被刪除。

數據：

DT <- data.table(t_id = c(1,4,2,3,5), 
       airport = c("a","a","d","a","d"), 
       thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2), 
       key=c("airport","thisTime")) 

DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)), 
        airport = c("a","b","c","d","a","d","e"), 
        thisTime = c(6,7,8,9,6,7,8), 
        key=c("airport","thisTime"))

來源

2015-08-14 14:00:04 Jaap

很高興看到這個帖子。我正在嘗試這一段時間.. – akrun

很好的解釋 - 「滾動連接取得最後一個鍵組合出現的行」 - 是我理解感謝的關鍵。 – tospig

而你的'天花板'例子在這種情況下效果很好，但是我希望當'dt $ thisTime2'的值遠離它試圖匹配的'dt_lookup $ thisTime'的時間單位大於'1'時，它就不會工作到，所以我可能不得不想出一個替代方案？ – tospig

使用重複鍵滾動加入data.table

回答

相關問題