2016-06-24 145 views
4

我有這種格式的矩陣:高效摺疊矩陣

set.seed(1) 
mat <- matrix(round(runif(25,0,1)),nrow=5,ncol=5) 
colnames(mat) <- c("a1::C","a1::A","a1::B","b1::D","b1::A") 

    a1::C a1::A a1::B b1::D b1::A 
[1,]  0  1  0  0  1 
[2,]  0  1  0  1  0 
[3,]  1  1  1  1  1 
[4,]  1  1  0  0  0 
[5,]  0  0  1  1  0 

在也就是說,每列是一個對象和特徵(由在那裏它們被分隔在列名錶示::)。在每一行中,值爲1表示該主題具有該特徵,如果不是,則爲0。有可能某個主體在其所有列中都有0的特定行。

我想要構造一個新的矩陣,其中列將是主題(即每個主題一列),並且在行中這個主題具有的特徵將按字母順序排列並且以逗號分隔。如果受試者沒有任何特徵(即對於該受試者某個行全爲0),則應使用「W」值(這些特徵都不具有「W」值)。

下面是基於mat新的矩陣將是什麼樣子:

cnames = unique(sapply(colnames(mat), function(x) strsplit(x,split="::")[[1]][1])) 
new_mat <- matrix(c("A","A","A,B,C","A,C","B", 
        "A","D","A,D","W","D"), 
        nrow=nrow(mat),ncol=length(cnames)) 
colnames(new_mat) = cnames 

    a1  b1 
[1,] "A"  "A" 
[2,] "A"  "D" 
[3,] "A,B,C" "A,D" 
[4,] "A,C" "W" 
[5,] "B"  "D" 

任何想法,這將是實現這一目標的有效和優雅的方式?

回答

2

這是一個起點。然而,取決於你有多少變量,這可能會變得麻煩。

library(data.table) 
dt = data.table(id = seq_len(nrow(mat)), mat) 
longDt <- melt(dt, id.vars = "id", measure = patterns("^a1::", "^b1::")) 

longDt[, .(a1 = list(sort(c("C", "A", "B")[as.logical(value1)])), 
      b1 = list(sort(c("D", "A")[as.logical(value2)]))), .(id)] 
    id a1 b1 
1: 1  A A 
2: 2  A D 
3: 3 A,B,C A,D 
4: 4 A,C  
5: 5  B D 
4

步驟1:矩陣柱樞轉

mat <- mat[, order(colnames(mat))] 

#  a1::A a1::B a1::C b1::A b1::D 
# [1,]  1  0  0  1  0 
# [2,]  1  0  0  0  1 
# [3,]  1  1  1  1  1 
# [4,]  1  0  1  0  0 
# [5,]  0  1  0  0  1 

步驟2.1:列名分解

## decompose levels, get main levels (before ::) and sub levels (post ::) 
decom <- strsplit(colnames(mat), "::") 

main_levels <- sapply(decom, "[", 1) 
# [1] "a1" "a1" "a1" "b1" "b1" 

sub_levels <- sapply(decom, "[", 2) 
# [1] "A" "B" "C" "A" "D" 

步驟2.2:分組索引生成

## generating grouping index 
main_index <- paste(rep(main_levels, each = nrow(mat)), rep(1:nrow(mat), times = ncol(mat)), sep = "#") 
sub_index <- rep(sub_levels, each = nrow(mat)) 
sub_index[!as.logical(mat)] <- "" ## 0 values in mat implies "" 

## in unclear of what "main_index" and "sub_index" are, check: 

## matrix(main_index, nrow(mat)) 
#  [,1] [,2] [,3] [,4] [,5] 
# [1,] "a1#1" "a1#1" "a1#1" "b1#1" "b1#1" 
# [2,] "a1#2" "a1#2" "a1#2" "b1#2" "b1#2" 
# [3,] "a1#3" "a1#3" "a1#3" "b1#3" "b1#3" 
# [4,] "a1#4" "a1#4" "a1#4" "b1#4" "b1#4" 
# [5,] "a1#5" "a1#5" "a1#5" "b1#5" "b1#5" 

## matrix(sub_index, nrow(mat)) 
#  [,1] [,2] [,3] [,4] [,5] 
# [1,] "A" "" "" "A" "" 
# [2,] "A" "" "" "" "D" 
# [3,] "A" "B" "C" "A" "D" 
# [4,] "A" "" "C" "" "" 
# [5,] "" "B" "" "" "D" 

步驟2.3:有條件的崩潰粘貼

## collapsed paste of "sub_index" conditional on "main_index" 
x <- unname(tapply(sub_index, main_index, paste0, collapse = "")) 
x[x == ""] <- "W" 
# [1] "A" "A" "ABC" "AC" "B" "A" "D" "AD" "W" "D" 

步驟3:後處理

我不是很滿意這一點,但沒有找到一個替代。

x <- sapply(strsplit(x, ""), paste0, collapse = ",") 
# [1] "A" "A" "A,B,C" "A,C" "B" "A" "D" "A,D" "W" "D" 

步驟4:矩陣

x <- matrix(x, nrow = nrow(mat)) 
colnames(x) <- unique(main_levels) 

#  a1  b1 
# [1,] "A"  "A" 
# [2,] "A"  "D" 
# [3,] "A,B,C" "A,D" 
# [4,] "A,C" "W" 
# [5,] "B"  "D" 

效率考慮

的方法本身是使用矢量相當有效,並且不需要的分組信息的手動輸入。例如,當您有幾百個主要組(前::)和數百個子組(後::)時,您可以使用相同的代碼。

唯一的考慮,是減少不必要的內存拷貝。在這方面,我們應該儘可能使用匿名函數,而不需要像上面演示的那樣明確的矩陣分配。這將是好的(已經測試):

decom <- strsplit(sort(colnames(mat)), "::") 
main_levels <- sapply(decom, "[", 1) 

sub_index <- rep(sapply(decom, "[", 2), each = nrow(mat)) 
sub_index[!as.logical(mat[, order(colnames(mat))])] <- "" 

x <- unname(tapply(sub_index, 
        paste(rep(main_levels, each = nrow(mat)), 
          rep(1:nrow(mat), times = ncol(mat)), 
          sep = "#"), 
        paste0, collapse = "")) 

x <- matrix(sapply(strsplit(x, ""), paste0, collapse = ","), 
      nrow = nrow(mat)) 

colnames(x) <- unique(main_levels)