我很難將一個簡單的data.table連接示例應用於較大的(10GB)數據集。 merge()對大數據集的data.frames工作正常,儘管我很想利用data.table中的速度。任何人都可以指出我對data.table的誤解(特別是錯誤信息)?如何加入data.tables當一個查找表?
這裏是一個簡單的例子(來自此線程:Join of two data.tables fails)。
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
較大數據集包括兩個data.frames的:temp.3561(感興趣的數據集)和temp.versions(查找數據集)。它們分別具有與DT和查找(上面)相同的結構。使用合併()效果很好,但是我的data.table的應用顯然是有缺陷的:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions具有相同的結構,查找(在簡單的例子),以及關鍵「ID」由779473唯一值(不重複)。
DTtemp3561與DT(在簡單示例中)以及其他一些變量的結構相同,但其關鍵「id」僅具有829個唯一值,儘管有7,946,667個觀察值(大量重複項)。
因爲我只是想從DTtemp.versions中添加版本號和年份到DTtemp.3561中的每個觀察值,所以合併的data.table應該具有與DTtemp.3561(7,946,667)相同數量的觀察值。具體來說,我不明白爲什麼merge()在使用data.table時會生成「多餘的」觀察值,但在使用data.frame時不會。
同樣
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
包含allow.cartesian = T和NOMATCH = 0不丟棄 「過剩」 的觀測。奇怪的是,如果我將感興趣的數據集截斷爲10個觀察點,則merge()對data.frames和data.tables都可以正常工作。
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
我已經通過常見問題(http://datatable.r-forge.r-project.org/datatable-faq.pdf,特別是1.12)。你會如何看待這個問題?
@阿倫,我實際上在閱讀提問之前鏈接到的帖子,但沒有真正理解這個概念。可能是因爲合併(DTtemp.versions,DTtemp.3561,by =「id」,allow。cartesian = TRUE)'也返回11,227,332個觀測數據(不匹配不會改變這一點)。也許這就是爲什麼我對c.cartesian是或不在做什麼感到困惑。 –
@阿倫,你是對的:我不清楚爲什麼a)merge()不能與這些data.tables一起工作(我會試着去看一個更大的工作示例),b)如何用DT [lookup,...]類似的結構去做這件事(因爲allow.cartesian = T且nomatch = 0似乎沒有放棄對大數據集的「過量」觀察)。但合併()確實在原始data.frames上工作,所以我不是沒有追索 –
@Arun,謝謝... –