2013-07-25 82 views
2

我有兩個數據表,我們可以稱它們爲weightsvalues
weights表具有5列如下:data.table連接使用兩個列從一個表和一列從其他

first POSIXct 
late POSIXct 
nodeid integer 
aggid integer 
weight numeric 

values表具有這些列

nodeid integer 
Date POSIXct 
hour integer 
value decimal 

的想法是,以產生一個新的表,其中將採取的節點的加權平均成基於權重的聚合節點。但是,權重隨時間而變化,需要根據第一個和最後一個日期進行匹配。 SQL語法要做到這一點會是這個樣子

select v.Date, v.hour, w.aggid, sum(v.value*w.weight) as aggvalue 
from values v inner join weights w 
on v.nodeid=w.nodeid and v.date between w.first and w.late 
group by aggid, date, hour 

我真的不知道從哪裏開始就這一個在SQL語法給出的between邏輯。這可能在data.table語法中,或者我需要將weights表變成每一天都有一行,而不是使用範圍?

下面是一些示例數據(抱歉它是如此長)...

values<-data.table(nodeid = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L), Date = c("2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10", "2013-07-02", "2013-07-02", "2013-07-05", 
"2013-07-08", "2013-07-10"), hour = c(1L, 2L, 23L, 2L, 2L, 1L, 
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L, 1L, 
2L, 23L, 2L, 2L, 1L, 2L, 23L, 2L, 2L), value = c(8.234, 3.218, 
0.787, 8.689, 6.218, 6.89, 1.914, 2.459, 6.683, 8.122, 0.281, 
1.136, 1.993, 7.27, 9.582, 5.777, 1.375, 9.204, 7.862, 0.633, 
2.433, 1.842, 7.178, 10.692, 1.417, 1.259, 2.619, 0.031, 6.744, 
5.941)) 

weights<-data.table(first = c("2013-07-01", "2013-07-01", "2013-07-01", 
"2013-07-01", "2013-07-01", "2013-07-01", "2013-07-08", "2013-07-08", 
"2013-07-08", "2013-07-08", "2013-07-08", "2013-07-08"), late = c("2013-07-07", 
"2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07", "2013-07-07", 
"2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20", "2013-07-20", 
"2013-07-20"), nodeid = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 
4L, 5L, 6L), aggid = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 
2L, 2L), weight = c(0.5, 0.25, 0.25, 0.3, 0.5, 0.2, 0.6, 0.2, 
0.2, 0.4, 0.45, 0.15)) 

exresults<-data.table(aggid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L), Date = c("2013-07-02", "2013-07-02", "2013-07-02", "2013-07-02", 
"2013-07-05", "2013-07-05", "2013-07-08", "2013-07-08", "2013-07-10", 
"2013-07-10"), hour = c(1L, 1L, 2L, 2L, 23L, 23L, 2L, 2L, 2L, 
2L), aggvalue = c(5.90975, 3.2014, 2.3715, 1.8573, 1.5065, 6.3564, 
8.004, 8.9678, 7.2716, 1.782)) 

回答

2

使用roll PARAM的data.table加入:

setkey(values, nodeid, Date) 
setkey(weights, nodeid, late) 

weights[values, roll = -Inf][, list(aggvalue = sum(weight*value)), 
           by = list(aggid, Date = late, hour)] 
# aggid  Date hour aggvalue 
# 1:  1 2013-07-02 1 5.90975 
# 2:  1 2013-07-02 2 2.37150 
# 3:  1 2013-07-05 23 1.50650 
# 4:  1 2013-07-08 2 8.00400 
# 5:  1 2013-07-10 2 7.27160 
# 6:  2 2013-07-02 1 3.20140 
# 7:  2 2013-07-02 2 1.85730 
# 8:  2 2013-07-05 23 6.35640 
# 9:  2 2013-07-08 2 8.96780 
#10:  2 2013-07-10 2 1.78200 

注:我會,如果正確小心範圍不存在 - 我沒有測試這個邊緣情況。

+0

我看你不需要'第一個'專欄,並承認我需要做更多的閱讀'卷',但有沒有辦法使用'第一'而不是'遲'。我的真實數據來源,我只有'第一個'日期,我不得不創建'late'列來創建範圍。如果有一種語法允許我只使用「第一個」列,那麼我可以完全跳過創建「late」列。這是可能的還是一個必要的步驟? –

+1

我將'weights'上的鍵改爲'first'而不是'late',並將'roll = -Inf'改爲'roll = Inf',這看起來可行。 –

+0

@DeanMacGregor是的,就是這樣 – eddi

相關問題