對於大數據，在不同的行和列中進行多個匹配

我在兩個不同的表（100萬行* 15; 3000 * 20）中可能會獲得更大的（1000萬行）多重匹配問題。對於大數據，在不同的行和列中進行多個匹配

我的解決方案可行，但我想盡可能快地考慮到我可能需要在更大的數據框中使用腳本。我正在使用r軟件包data.table。

考慮兩個例子的表，其中沒有行可以被刪除：

表1 - 的ToMach柱等於FALSE意味着相關聯的標籤不存在於表2，該步驟降低由兩個順序匹配到的大小執行：

set.seed(99) 
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T)) 
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)] 

table1 
    Tag ToMatch 
1: tag_2 TRUE 
2: tag_1 FALSE 
3: tag_3 TRUE 
4: tag_3 TRUE 
5: tag_2 TRUE

表2：

set.seed(99) 
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T), 
       north = sample(paste0("tag_",2:8), 5, replace = T), 
       south = sample(paste0("tag_",2:8), 5, replace = T)) 

> table2 
    center north south 
1: tag_6 tag_8 tag_5 
2: tag_2 tag_6 tag_5 
3: tag_6 tag_4 tag_3 
4: tag_8 tag_4 tag_6 
5: tag_5 tag_3 tag_6

我的目標是找到表2表1哪裏的標籤被發現的行（可在以上各列的任何一列中）。我想輸出爲列表：

輸出：

 Tag ToMatch output 
1: tag_2 TRUE  2 
2: tag_1 FALSE  NA 
3: tag_3 TRUE 3,5 
4: tag_3 TRUE 3,5 
5: tag_2 TRUE  2

我的解決辦法：

什麼表1的行是評估

match.index <- which(table1$ToMatch == T) 
> match.index 
[1] 1 3 4 5

池中的所有標籤從表2中維護排序。使用t（tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...）

all.tags <- as.vector(t(table2)) 
> all.tags 
[1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6" 
[8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3" 
[15] "tag_6"

預定義的空列表

list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))

循環：

for (i in 1:length(match.index)) { 

    list.results[[ match.index[i] ]] <- ceiling(

     grep(table1[match.index[i], Tag], all.tags) 

     /3) 
} 

# dividing the index of all.tags found with grep by 3 (the original 
# number of columns in table2) and rounding up to the closest integer 
# (ceiling) return the index of the original table 2 where the tag 
# is located

最終輸出：

> table1[ , output := list.results] 
> table1 
    Tag ToMatch output 
1: tag_2 TRUE  2 
2: tag_1 FALSE  NA 
3: tag_3 TRUE 3,5 
4: tag_3 TRUE 3,5 
5: tag_2 TRUE  2

你有什麼建議，以加快這一代碼？

預先感謝您

來源

2016-11-08 Gerald T

任何好的理由，表1中的重複行？如果您對速度感興趣，那麼這種數據結構決策可能是一個減速帶。同樣，使用'melt（table2 [，r：= .I]，「r」，value.name =「Tag」）可以更好地實現表格2的寬格式存儲...... – Frank

@Frank，上表是具有更多字段的較大表格的快照。在現實中，沒有行將被複制 –

難度主要在table2的廣泛表示。一旦這就是被回爐，其餘很簡單：

melt(table2[, id := .I], id = 'id')[ 
    table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI] 
# value V1 
#1: tag_2 2 
#2: tag_1 NULL 
#3: tag_3 5,3 
#4: tag_3 5,3 
#5: tag_2 2

如果你有很多重複的 - 獨特的數據事先：

melt(table2[, id := .I], id = 'id')[ 
    unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][ 
    table1, on = c(value = 'Tag')]

來源

2016-11-08 19:35:49 eddi

好吧，我研究了你的代碼，我想我幾乎可以得到它的全部。感謝它非常好。你能否澄清一下_。（list（if（ToMatch）id））_ part是做什麼的？ –

@GeraldT它創建一個'list'列，其中如果'ToMatch'爲'TRUE'，則列表的條目是匹配的ID，否則爲空 – eddi

@ eddi你知道爲什麼'on ='Tag''沒有工作？從'example（data.table)'看來它應該起作用。非常感謝，我覺得我從你的回答中學到了很多東西。 –

這裏有一個位的基礎R代碼將這樣的伎倆：

table1 <- within(table1, { 
       output <- NA 
       output[ToMatch] <- sapply(Tag[ToMatch], function(x) 
            paste(which(x == table2, arr.ind=TRUE)[,1], collapse=",")) 
})

table1的

 Tag ToMatch output 
1: tag_2 TRUE  2 
2: tag_1 FALSE  NA 
3: tag_3 TRUE 5,3 
4: tag_3 TRUE 5,3 
5: tag_2 TRUE  2

下面是一個簡單的描述。 within允許在一個對象內引用（通常是一個數據框），並減少了鍵入一個位的需要。首先，分配輸出NA。然後，對於要匹配的每個輸出元素（使用ToMatch），請使用which和arr.ind = TRUE參數查找與每個匹配的元素的行。 paste將每個元素的結果放在一起，摺疊爲「，」。

甲data.table模擬上面的代碼是

table1[, output := NA_character_][as.logical(ToMatch), 
     output := sapply(Tag, function(x) paste(which(x == table2, arr.ind=TRUE)[,1], 
               collapse=","))][] 
    Tag ToMatch output 
1: tag_2 TRUE  2 
2: tag_1 FALSE  NA 
3: tag_3 TRUE 5,3 
4: tag_3 TRUE 5,3 
5: tag_2 TRUE  2

第一[]創建NA向量和所述第二子集到感興趣的元件和在與所述期望的值NA載體填充。代碼的「填寫」部分與上面的代碼相同。

來源

2016-11-08 18:58:12 lmo

感謝您的答案，它的工作原理。但是你的解決方案產生一個字符列而不是列表列:) –

對於大數據，在不同的行和列中進行多個匹配

回答

相關問題