2017-09-13 93 views
1

我的原始數據是myDF的模式(沒有重複):R:轉移data.table

group hed_pfnpi id 
1: aa 111111 18 
2: aa 111111 17 
3: aa 222222 18 
4: aa 333333 14 
5: aa 444444 13 
6: aa 555555 18 
7: aa 555555 24 
8: aa 222222 13 
9: aa 222222 17 
10: aa 333333 17 
11: bb 666666 9 
12: bb 666666 3 
13: bb 888888 9 
14: bb 999999 14 
15: bb 444444 13 
16: bb 555555 9 
17: bb 555555 24 
18: bb 888888 13 
19: bb 888888 3 
20: bb 999999 3 

我想轉移是myDF導致表:

group one two weight id_list 
1 aa 111111 222222  2 17,18 
2 aa 111111 333333  1  17 
3 aa 111111 555555  1  18 
4 aa 222222 333333  1  17 
5 aa 222222 444444  1  13 
6 aa 222222 555555  1  18 
7 bb 444444 888888  1  13 
8 bb 555555 666666  1  9 
9 bb 555555 888888  1  9 
10 bb 666666 888888  2  3,9 
11 bb 666666 999999  1  3 
12 bb 888888 999999  1  3 

首先,按數據根據組列然後

如果hed_pfnpi共享相同的id,它們將成爲結果表中的一對;

id_list:相應的共享ID;

重量:ID_LIST的長度

library(data.table) 
library(dplyr) 
library(magrittr) 
library(tidyverse) 


mydf1 <- data.table(structure(list(group = rep("aa",10),hed_pfnpi = c(111111L, 111111L, 222222L, 333333L, 444444L, 
              555555L, 555555L, 222222L, 222222L, 333333L), id = c(18L, 17L, 
                            18L, 14L, 13L, 18L, 24L, 13L, 17L, 17L)), .Names = c("group","hed_pfnpi",                                      "id"), class = "data.frame", row.names = c(NA, -10L))) 
mydf2 <- data.table(structure(list(group = rep("bb",10),hed_pfnpi = c(666666L, 666666L, 888888L, 999999L, 444444L, 
              555555L, 555555L, 888888L, 888888L, 999999L), id = c(9L, 3L, 
                            9L, 14L, 13L, 9L, 24L, 13L, 3L, 3L)), .Names = c("group","hed_pfnpi",                                      "id"), class = "data.frame", row.names = c(NA, -10L))) 
mydf <- rbind(mydf1,mydf2) 


# try code 
result <- merge(mydf, mydf, by = "id", allow.cartesian=TRUE) %>% 
    filter(group.x == group.y) %>% 
    transmute(group = group.x, 
      one = pmin(hed_pfnpi.x, hed_pfnpi.y), 
      two = pmax(hed_pfnpi.x, hed_pfnpi.y), 
      id) %>% 
    filter(one != two) %>% 
    unique() %>% 
    group_by(group,one, two) %>% 
    summarise(id_list = paste(id, collapse = ","), 
      weight = n()) %>% 
    select(group,one, two,weight, id_list) 

我嘗試代碼是在這裏,它可以得到預期的結果,但它的效率不高(崩潰時數據很大)。希望有人能爲我提供更好的解決方案。

回答

2

我會做(只加載data.table,而不是其他的包)......

mydf[, 
    CJ(one = hed_pfnpi, two = hed_pfnpi)[one < two] 
, keyby=.(group, id)][, 
    .(n = .N, ids = toString(id)) 
, keyby=.(group, one, two)] 

這給

group one two n ids 
1: aa 111111 222222 2 17, 18 
2: aa 111111 333333 1  17 
3: aa 111111 555555 1  18 
4: aa 222222 333333 1  17 
5: aa 222222 444444 1  13 
6: aa 222222 555555 1  18 
7: bb 444444 888888 1  13 
8: bb 555555 666666 1  9 
9: bb 555555 888888 1  9 
10: bb 666666 888888 2 3, 9 
11: bb 666666 999999 1  3 
12: bb 888888 999999 1  3