2014-05-23 31 views
1

我有兩個數據表:ab[R data.table比較表之間的日期和計數記錄

a = structure(list(id = c(86246, 86252, 12262064), brand = c(3718L, 
13474L, 17286L), offerdate = structure(c(15454, 15791, 15883), class = "Date")), .Names = c("id", 
"brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>) 

b = structure(list(id = c(86246, 86246, 86246), brand = c(3718, 3718, 
875), date = structure(c(15408, 15430, 15434), class = "Date")), .Names = c("id", 
"brand", "date"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>) 

> a 
     id brand offerdate 
1: 86246 3718 2012-04-24 
2: 86252 13474 2013-03-27 
3: 12262064 17286 2013-06-27 
> b 
     id brand  date 
1: 86246 3718 2012-03-09 
2: 86246 3718 2012-03-31 
3: 86246 875 2012-04-04 

現在我想,對於在每個ID,計算行數b中的相同的ID和品牌,日期少於a.offerdate前的30天。

我希望有結果是一個更新:

> a 
     id brand offerdate nbTrans_last_30_days 
1: 86246 3718 2013-04-24      1 
2: 86252 13474 2013-03-27      0 
3: 12262064 17286 2013-06-27      0 

我可以做子集的工作,但我正在尋找一個快速的解決方案。 子集版本是做(對的每一行):

subset(b, (id == 86246) & (brand == 3718) & (date > as.Date("2012-03-24"))) 

與根據a.offerdate日期。

我在管理中b爲計數總行:

> setkey(a,id, brand) 
> setkey(b,id, brand) 
> a = a[b[a, .N]] 
> setnames(a, "N", "nbTrans") 
> a 
     id brand offerdate nbTrans 
1: 86246 3718 2012-04-24  2 
2: 86252 13474 2013-03-27  0 
3: 12262064 17286 2013-06-27  0 

,但我不知道如何處理這兩個表之間的日期進行比較。


下面的答案適用於原來的小數據集,但不知何故沒有爲我的真實的數據。 我試着用兩個新變量來重現問題:A2和B2

a2=structure(list(id = c(86246, 86252, 12262064), brand = structure(c(3L, 
+ 9L, 12L), .Label = c("875", "1322", "3718", "4294", "5072", "6732", 
+ "6926", "7668", "13474", "13791", "15889", "17286", "17311", 
+ "26189", "26456", "28840", "64486", "93904", "102504"), class = "factor"), 
+  offerdate = structure(c(15819, 15791, 15883), class = "Date")), .Names = c("id", 
+ "brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
+ "data.frame")) 

b2=structure(list(id = c(86246, 86246, 86246, 86246, 86246, 86246, 
+ 86246, 86246), brand = c(3718L, 3718L, 3718L, 3718L, 3718L, 3718L, 
+ 3718L, 3718L), date = structure(c(15423, 15724, 15752, 15767, 
+ 15782, 15786, 15788, 15811), class = "Date")), .Names = c("id", 
+ "brand", "date"), sorted = c("id", "brand"), class = c("data.table", 
+ "data.frame")) 

> setkey(a2,id,brand) 
> setkey(b2,id,brand) 
> merge(a2, b2, all.x = TRUE, allow.cartesian = TRUE) 
     id brand offerdate date 
1: 86246 3718 2013-04-24 <NA> 
2: 86252 13474 2013-03-27 <NA> 
3: 12262064 17286 2013-06-27 <NA> 

的問題是,合併不留b2.date信息。

回答

2

訣竅是使用mergeallow.cartesian說法:

setkey(a, id, brand) 
setkey(b, id, brand) 

c <- merge(a, b, all.x = T, allow.cartesian = T) 

c[, Trans := (offerdate - date) <= 30] 

c[, list(nbTrans_last_30_days = sum(Trans, na.rm = T)), 
    keyby = list(id, brand, offerdate)] 
+0

謝謝。它適用於這個小例子。不幸的是,在申請我的真實數據集時,c.date全部爲。 – tucson

+0

我用數據集a2,b2更新了問題,其中合併不保留日期信息。我不知道爲什麼。 – tucson

+1

@tucson這是因爲變量的類不匹配。參見'sapply(a2,class)'和'sapply(b2,class)'。在一種情況下,'brand'是'factor',另一種是'integer'。 – djhurio