1
我的原始數據是myDF的模式(沒有重複):R:轉移data.table
group hed_pfnpi id
1: aa 111111 18
2: aa 111111 17
3: aa 222222 18
4: aa 333333 14
5: aa 444444 13
6: aa 555555 18
7: aa 555555 24
8: aa 222222 13
9: aa 222222 17
10: aa 333333 17
11: bb 666666 9
12: bb 666666 3
13: bb 888888 9
14: bb 999999 14
15: bb 444444 13
16: bb 555555 9
17: bb 555555 24
18: bb 888888 13
19: bb 888888 3
20: bb 999999 3
我想轉移是myDF導致表:
group one two weight id_list
1 aa 111111 222222 2 17,18
2 aa 111111 333333 1 17
3 aa 111111 555555 1 18
4 aa 222222 333333 1 17
5 aa 222222 444444 1 13
6 aa 222222 555555 1 18
7 bb 444444 888888 1 13
8 bb 555555 666666 1 9
9 bb 555555 888888 1 9
10 bb 666666 888888 2 3,9
11 bb 666666 999999 1 3
12 bb 888888 999999 1 3
首先,按數據根據組列然後
如果hed_pfnpi共享相同的id,它們將成爲結果表中的一對;
id_list:相應的共享ID;
重量:ID_LIST的長度
library(data.table)
library(dplyr)
library(magrittr)
library(tidyverse)
mydf1 <- data.table(structure(list(group = rep("aa",10),hed_pfnpi = c(111111L, 111111L, 222222L, 333333L, 444444L,
555555L, 555555L, 222222L, 222222L, 333333L), id = c(18L, 17L,
18L, 14L, 13L, 18L, 24L, 13L, 17L, 17L)), .Names = c("group","hed_pfnpi", "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf2 <- data.table(structure(list(group = rep("bb",10),hed_pfnpi = c(666666L, 666666L, 888888L, 999999L, 444444L,
555555L, 555555L, 888888L, 888888L, 999999L), id = c(9L, 3L,
9L, 14L, 13L, 9L, 24L, 13L, 3L, 3L)), .Names = c("group","hed_pfnpi", "id"), class = "data.frame", row.names = c(NA, -10L)))
mydf <- rbind(mydf1,mydf2)
# try code
result <- merge(mydf, mydf, by = "id", allow.cartesian=TRUE) %>%
filter(group.x == group.y) %>%
transmute(group = group.x,
one = pmin(hed_pfnpi.x, hed_pfnpi.y),
two = pmax(hed_pfnpi.x, hed_pfnpi.y),
id) %>%
filter(one != two) %>%
unique() %>%
group_by(group,one, two) %>%
summarise(id_list = paste(id, collapse = ","),
weight = n()) %>%
select(group,one, two,weight, id_list)
我嘗試代碼是在這裏,它可以得到預期的結果,但它的效率不高(崩潰時數據很大)。希望有人能爲我提供更好的解決方案。