2015-12-17 24 views
2

我有一個矩陣,其N行(聚類算法的迭代)的每一行都包含M點(列)所屬的每個簇:從具有標籤的N個向量快速計算共生矩陣

例如:

data <- t(rmultinom(50, size = 7, prob = rep(0.1,10))) 

    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
[1,] 0 0 0 2 1 1 0 2 1  0 
[2,] 3 1 2 0 0 0 0 1 0  0 
[3,] 0 1 2 1 0 0 0 0 2  1 
[4,] 0 1 1 0 2 0 0 2 0  1 
[5,] 3 0 0 0 2 1 0 0 0  1 
[6,] 0 1 2 0 0 1 1 2 0  0 
[7,] 0 1 0 1 0 1 1 2 1  0 
[8,] 3 0 0 2 0 0 0 1 0  1 
... 

我想建立一個共生矩陣,其中位置(i,j)爲兩個點已通過看到的同一集羣中的次數的總和不同的行。

一個天真的做法是:

coincidences <- matrix(0, nrow=10, ncol=10) 
    for (n in 1:50){ 
    for (m in 1:10){ 
     coincidences[m,] <- coincidences[m,] + as.numeric(data[n,m] == data[n,]) 
     } 
    } 

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 
[1,] 50 17 21 22 15 14 16 20 18 18 
[2,] 17 50 17 14 17 18 15 14 20 16 
[3,] 21 17 50 20 21 16 16 13 16 20 
[4,] 22 14 20 50 16 18 16 21 18 14 
[5,] 15 17 21 16 50 18 16 17 11 17 
[6,] 14 18 16 18 18 50 18 22 25 13 
[7,] 16 15 16 16 16 18 50 14 20 22 
[8,] 20 14 13 21 17 22 14 50 11 15 
[9,] 18 20 16 18 11 25 20 11 50 18 
[10,] 18 16 20 14 17 13 22 15 18 50 

如何我可以使其更快?

額外:如何使用ggplot2來繪製它? (我已經看到了gplotsheatmap.2但我不知道這是不是矯枉過正)

回答

1
使用矢量和colSums

比較快的方式:使用

> set.seed(1) 
> data <- t(rmultinom(10000, size = 7, prob = rep(0.1,100))) 
> 
> system.time({ 
+ coincidences <- matrix(0, nrow=100, ncol=100) 
+ for (n in 1:10000){ 
+ for (m in 1:100){ 
+  coincidences[m,] <- coincidences[m,] + as.numeric(data[n,m] == data[n,]) 
+  } 
+ }} 
+) 
    user system elapsed 
    9.692 0.000 9.708 
> 
> system.time(coincidences2<-sapply(1:ncol(data), function(i){ colSums(data[,i]==data) })) 
    user system elapsed 
    0.676 0.096 0.774 
> 
> all.equal(coincidences2,coincidences) 
[1] TRUE 
4

RCPP

使用C R中++實現該RCPP包可以把工作做可能以最快的速度將得到

library(Rcpp) 

data <- t(rmultinom(50, size = 7, prob = rep(0.1,10))) 
    coincidences <- matrix(0, nrow=10, ncol=10) 

#R implementation 
fR<-function(data,coincidences){ 
for (n in 1:50){ 
    for (m in 1:10){ 

      coincidences[m,] <- coincidences[m,] + as.numeric(data[n,m] == data[n,]) 

    } 

} 
    return(coincidences) 
} 


#C++ Implementation 
cppFunction('NumericMatrix fC(NumericMatrix data, NumericMatrix coincidences) { 

    int nrow = data.nrow(), ncol = coincidences.ncol(); 
    NumericMatrix out(nrow, ncol); 
    int addon; 


    for (int n = 0; n < nrow; n++) { 
    for (int m = 0; m < ncol; m++) { 
     for (int p = 0; p < nrow; p++) { 

      if(data(n,m) == data(n,p)){ 
       addon = 1; 
      }else { 
       addon = 0; 
      } 

      coincidences(m,p) = coincidences(m,p) + addon; 


     } 

    } 

    } 
    return coincidences; 
}') 

#Call functions 
coincidences <- matrix(0, nrow=10, ncol=10) 
c1<-fC(data,coincidences) 
coincidences <- matrix(0, nrow=10, ncol=10) 
c2<-fR(data,coincidences) 
all.equal(c1,c2) 
> TRUE 


library(microbenchmark) 
microbenchmark(fC(data,coincidences),fR(data,coincidences)) 

> Unit: microseconds 
         expr  min  lq  mean median  uq  max neval 
    fC(data, coincidences) 6.415 6.736 8.88454 7.698 8.660 74.727 100 
    fR(data, coincidences) 283.514 290.089 301.84637 293.456 309.973 388.388 100 

編輯

要繪製:

library(reshape2) 
C<-fC(data,coincidences) 
ggplot(melt(C), aes(Var1,Var2, fill=value)) + geom_raster() 
+2

有在'fR'函數的誤差,應爲'(n的1:50)'代替'對(正在1:10)'。 fC功能也需要改變。 – fishtank

+0

謝謝@fishtank :) – aeongrail

+0

我在'all.equal'語句中得到''平均相對差異:0.4100652「'。不應該'p'遍歷'ncol'而不是'nrow'?另一件事是微陣列已經'***抓到段錯誤****地址0x3d33018,導致'未映射內存' – fishtank