2015-09-03 54 views
0

給定一個R數據幀是這樣的:收集跨數據幀的不同行連接的ID

DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"), 
        ID2 = c("D",NA,"G",NA,NA,NA,"H",NA), 
        ID3 = c("F",NA,NA,NA,NA,NA,NA,NA)) 

> DF.a 
    ID1 ID2 ID3 
1 A D F 
2 B <NA> <NA> 
3 C G <NA> 
4 D <NA> <NA> 
5 E <NA> <NA> 
6 F <NA> <NA> 
7 G H <NA> 
8 H <NA> <NA> 

我想簡化/它重塑成以下:

DF.b <- data.frame(ID1 = c("A","B","C","E"), 
        ID2 = c("D",NA,"G",NA), 
        ID3 = c("F",NA,"H",NA)) 

> DF.b 
    ID1 ID2 ID3 
1 A D F 
2 B <NA> <NA> 
3 C G H 
4 E <NA> <NA> 

似乎不就像一個簡單的重塑。目標是將所有「連接」ID值集中在一行上。注意「C」和「H」之間的連接是間接的,因爲二者都連接到「G」,但它們不會一起出現在DF.a的同一行上。 DF.b行中ID值的順序無關緊要。

回答

4

真的,你可以認爲這是試圖獲得圖形的所有連接組件。第一步我想借此將您的數據轉換成一種更自然的結構 - 節點的向量和矩陣的邊緣:

(nodes <- as.character(sort(unique(unlist(DF.a))))) 
# [1] "A" "B" "C" "D" "E" "F" "G" "H" 
(edges <- do.call(rbind, apply(DF.a, 1, function(x) { 
    x <- x[!is.na(x)] 
    cbind(head(x, -1), tail(x, -1)) 
}))) 
#  [,1] [,2] 
# ID1 "A" "D" 
# ID2 "D" "F" 
# ID1 "C" "G" 
# ID1 "G" "H" 

現在,您可以建立一個圖表,並計算其組件:

library(igraph) 
g <- graph.data.frame(edges, FALSE, nodes) 
(comp <- split(nodes, components(g)$membership)) 
# $`1` 
# [1] "A" "D" "F" 
# 
# $`2` 
# [1] "B" 
# 
# $`3` 
# [1] "C" "G" "H" 
# 
# $`4` 
# [1] "E" 

split函數的輸出是一個列表,其中每個列表元素是圖中某個組件的所有節點。我個人認爲這是輸出數據最有用的表示,但是如果您確實需要描述的NA填充結構,您可以嘗試如下所示:

max.len <- max(sapply(comp, length)) 
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x })) 
# [,1] [,2] [,3] 
# 1 "A" "D" "F" 
# 2 "B" NA NA 
# 3 "C" "G" "H" 
# 4 "E" NA NA