2016-12-21 67 views
6

我現在找不到重複。乘以兩個data.tables,保留所有可能

我的問題是這樣的:

我有兩個data.tables。一列有兩列(featurea,count),另一列有三列(featureb,featurec,count)。我想乘(?),以便我有一個新的data.table所有的可能性。訣竅是這些功能不匹配,因此merge解決方案可能無法解決問題。

MRE如下:

# two columns 
DT1 <- data.table(featurea =c("type1","type2"), count = c(2,3)) 

#  featurea count 
#1: type1  2 
#2: type2  3 

#three columns 
DT2 <- data.table(origin =c("house","park","park"), color =c("red","blue","red"),count =c(2,1,2)) 

# origin color count 
#1: house red  2 
#2: park blue  1 
#3: park red  2 

我預期的結果,在這種情況下,是一個data.table如下:

> DT3 
    origin color featurea total 
1: house red type1  4 
2: house red type2  6 
3: park blue type1  2 
4: park blue type2  3 
5: park red type1  4 
6: park red type2  6 
+1

會'DT2 [(featurea = DT1 [ 「featurea」], 計數=計數* DT1 [」 count「]]),by =。(origin,color)]'效率足夠高嗎? – Roland

+1

@羅蘭似乎是這樣,這聽起來是最好的答案,所以你應該這樣發佈 – Tensibai

回答

6

這將是一個辦法。首先,我在splitstackshape包中擴大了DT2中的行與expandRows()。自從我指定count = 2, count.is.col = FALSE以來,每行重複兩次。然後,我照顧乘法並創建了一個名爲total的新列。同時,我爲featurea創建了一個新列。最後,我放棄了count

library(data.table) 
library(splitstackshape) 

expandRows(DT2, count = nrow(DT1), count.is.col = FALSE)[, 
    `:=` (total = count * DT1[, count], featurea = DT1[, featurea])][, count := NULL] 

編輯

如果不希望添加其他的包,你可以嘗試大衛在他的評論的想法。

DT2[rep(1:.N, nrow(DT1))][, 
    `:=`(total = count * DT1$count, featurea = DT1$featurea, count = NULL)][] 



# origin color total featurea 
#1: house red  4 type1 
#2: house red  6 type2 
#3: park blue  2 type1 
#4: park blue  3 type2 
#5: park red  4 type1 
#6: park red  6 type2 
+0

@DavidArenburg是的,我同意你的看法。如果OP提供更詳細的示例,則此想法需要修訂。 '諾羅(DT1)'是個好主意。 – jazzurro

+0

@jazzurro更徹底的例子需要什麼?我的數據集比這個大得多,並且沒有相同的列名。我仍然贊成,雖然 – erasmortg

+0

@erasmortg我不是說我需要整個數據集。對困惑感到抱歉。 – jazzurro

0

隨着dplyr解決方案

library(dplyr) 
library(data.table) 

DT1 <- data.table(featurea =c("type1","type2"), count = c(2,3)) 
DT2 <- data.table(origin =c("house","park","park"), color =c("red","blue","red"),count =c(2,1,2)) 

創建一個虛擬列內連接上(對我來說它的key):

inner_join(DT1 %>% mutate(key=1), 
      DT2 %>% mutate(key=1), by="key") %>% 
mutate(total=count.x*count.y) %>% 
select(origin, color, featurea, total) %>% 
arrange(origin, color) 
8

請測試上更大的數據,我不知道這是如何優化:

DT2[, .(featurea = DT1[["featurea"]], 
     count = count * DT1[["count"]]), by = .(origin, color)] 
# origin color featurea count 
#1: house red type1  4 
#2: house red type2  6 
#3: park blue type1  2 
#4: park blue type2  3 
#5: park red type1  4 
#6: park red type2  6 

這可能是更有效的開關使用它,如果DT1少羣:

DT1[, c(DT2[, .(origin, color)], 
     .(count = count * DT2[["count"]])), by = featurea] 
# featurea origin color count 
#1: type1 house red  4 
#2: type1 park blue  2 
#3: type1 park red  4 
#4: type2 house red  6 
#5: type2 park blue  3 
#6: type2 park red  6