如何加入data.tables當一個查找表？

我很難將一個簡單的data.table連接示例應用於較大的（10GB）數據集。 merge（）對大數據集的data.frames工作正常，儘管我很想利用data.table中的速度。任何人都可以指出我對data.table的誤解（特別是錯誤信息）？如何加入data.tables當一個查找表？

這裏是一個簡單的例子（來自此線程：Join of two data.tables fails）。

# The data of interest. 
(DT <- data.table(id = c(rep(1154:1155, 2), 1160), 
        price = c(1.99, 2.50, 15.63, 15.00, 0.75), 
        key = "id")) 

    id price 
1: 1154 1.99 
2: 1154 15.63 
3: 1155 2.50 
4: 1155 15.00 
5: 1160 0.75 

# Lookup table. 
(lookup <- data.table(id  = 1153:1160, 
         version = c(1,1,3,4,2,1,1,2), 
         yr  = rep(2006, 4), 
         key  = "id")) 

    id version yr 
1: 1153  1 2006 
2: 1154  1 2006 
3: 1155  3 2006 
4: 1156  4 2006 
5: 1157  2 2006 
6: 1158  1 2006 
7: 1159  1 2006 
8: 1160  2 2006 

# The desired table. Note: lookup[DT] works as well. 
DT[lookup, allow.cartesian = T, nomatch=0] 

    id price version yr 
1: 1154 1.99  1 2006 
2: 1154 15.63  1 2006 
3: 1155 2.50  3 2006 
4: 1155 15.00  3 2006 
5: 1160 0.75  2 2006

較大數據集包括兩個data.frames的：temp.3561（感興趣的數據集）和temp.versions（查找數據集）。它們分別具有與DT和查找（上面）相同的結構。使用合併（）效果很好，但是我的data.table的應用顯然是有缺陷的：

# Merge data.frames: works just fine 
long.merged   <- merge(temp.versions, temp.3561, by = "id") 

# Convert the data.frames to data.tables 
DTtemp.3561   <- as.data.table(temp.3561) 
DTtemp.versions  <- as.data.table(temp.versions) 

# Merge the data.tables: doesn't work 
setkey(DTtemp.3561, id) 
setkey(DTtemp.versions, id) 
DTlong.merged  <- merge(DTtemp.versions, DTtemp.3561, by = "id") 

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : 
    Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate 
key values in i, each of which join to the same group in x over and over again. If that's ok, 
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the 
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. 
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable- 
help for advice.

DTtemp.versions具有相同的結構，查找（在簡單的例子），以及關鍵「ID」由779473唯一值（不重複）。

DTtemp3561與DT（在簡單示例中）以及其他一些變量的結構相同，但其關鍵「id」僅具有829個唯一值，儘管有7,946,667個觀察值（大量重複項）。

因爲我只是想從DTtemp.versions中添加版本號和年份到DTtemp.3561中的每個觀察值，所以合併的data.table應該具有與DTtemp.3561（7,946,667）相同數量的觀察值。具體來說，我不明白爲什麼merge（）在使用data.table時會生成「多餘的」觀察值，但在使用data.frame時不會。

同樣

# Same error message, but with 12,055,777 observations 
altDTlong.merged <- DTtemp.3561[DTtemp.versions] 

# Same error message, but with 11,277,332 observations 
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]

包含allow.cartesian = T和NOMATCH = 0不丟棄「過剩」的觀測。奇怪的是，如果我將感興趣的數據集截斷爲10個觀察點，則merge（）對data.frames和data.tables都可以正常工作。

# Merge short DF: works just fine 
short.3561   <- temp.3561[-(11:7946667),] 
short.merged  <- merge(temp.versions, short.3561, by = "id") 

# Merge short DT 
DTshort.3561  <- data.table(short.3561, key = "id") 
DTshort.merged  <- merge(DTtemp.versions, DTshort.3561, by = "id")

我已經通過常見問題（http://datatable.r-forge.r-project.org/datatable-faq.pdf，特別是1.12）。你會如何看待這個問題？

來源

2014-05-04 Pat W.

@阿倫，我實際上在閱讀提問之前鏈接到的帖子，但沒有真正理解這個概念。可能是因爲合併（DTtemp.versions，DTtemp.3561，by =「id」，allow。cartesian = TRUE）'也返回11,227,332個觀測數據（不匹配不會改變這一點）。也許這就是爲什麼我對c.cartesian是或不在做什麼感到困惑。 –

@阿倫，你是對的：我不清楚爲什麼a）merge（）不能與這些data.tables一起工作（我會試着去看一個更大的工作示例），b）如何用DT [lookup，...]類似的結構去做這件事（因爲allow.cartesian = T且nomatch = 0似乎沒有放棄對大數據集的「過量」觀察）。但合併（）確實在原始data.frames上工作，所以我不是沒有追索 –

@Arun，謝謝... –

任何人都可以指出我對data.table（特別是錯誤信息）的誤解嗎？

直接給你回答。錯誤消息

將結果加入11277332行;超過7946667 = max（nrow（x），nrow（i））。檢查我的重複鍵值...

指出您的連接的結果比通常情況下預期的值更多。這意味着查找表鍵有重複的結果，在加入時會導致多個匹配。

如果它不回答你的問題，你應該重申它。

來源

2015-05-02 12:11:32 jangorecki

如何加入data.tables當一個查找表？

回答

相關問題