2013-01-24 33 views
1

我有唯一的整數座標和分數2列矩陣:如何有效地找到附近的整數?

> data1<-data.matrix(data.frame("coord"=sample(1:100,50),"scores"=rnorm(25))) 
> data1 
     coord  scores 
[1,] 22 -0.73799827 
[2,] 76 -0.78022310 
[3,] 62 0.45633095 
[4,] 77 0.56617413 
[5,] 60 -0.94876368 
[6,] 83 -1.20792643 
[7,] 85 -1.13890957 
[8,] 78 0.63959763 
[9,] 28 0.28039908 
[10,] 68 -1.04277456 
[11,] 27 0.48755194 
[12,] 66 0.09612861 
[13,] 69 -1.60932063 
[14,]  6 -0.66797103 
[15,] 10 -0.56594989 
[16,] 50 -0.79548555 
[17,] 39 1.13064066 
[18,] 75 0.21617203 
[19,] 34 -0.13480437 
[20,] 54 -1.64825097 
[21,] 48 -0.97955118 
[22,] 58 0.55307028 
[23,] 11 -0.99319227 
[24,] 42 -0.58430293 
[25,] 37 1.76576096 
[26,] 67 -0.73799827 
[27,] 65 -0.78022310 
[28,] 47 0.45633095 
[29,] 72 0.56617413 
[30,] 97 -0.94876368 
[31,] 57 -1.20792643 
[32,] 38 -1.13890957 
[33,] 16 0.63959763 
[34,] 15 0.28039908 
[35,] 86 -1.04277456 
[36,] 33 0.48755194 
[37,] 80 0.09612861 
[38,]  2 -1.60932063 
[39,] 93 -0.66797103 
[40,] 73 -0.56594989 
[41,] 40 -0.79548555 
[42,] 26 1.13064066 
[43,] 13 0.21617203 
[44,] 96 -0.13480437 
[45,] 41 -1.64825097 
[46,] 59 -0.97955118 
[47,] 46 0.55307028 
[48,] 43 -0.99319227 
[49,] 94 -0.58430293 
[50,] 21 1.76576096 

和獨特的座標的矢量:

> centers 
[1] 39 31 61 16 48 82 42 76 71 43 93 35 6 100 67 81 70 79 45 17 96 78 69 95 29 

我想創建在相對DATA1映射得分矩陣中心,每個中心位於矩陣的中間,每行一箇中心。換句話說,在一個矩陣中,我想看到在每個「中心」附近有一個座標的分數。我採取了以下做法:

> score_matrix<-matrix(nrow=length(centers),ncol=10) 
> for(i in 1:length(centers)){ 
+ data2 <- data1 
+ data2[,1] <- data2[,1] - centers[i] + ncol(score_matrix)/2 
+ region_scores <- subset(data2,data2[,1] > 0 & data2[,1] <= ncol(score_matrix)) 
+ score_matrix[i,region_scores[,1]]<-region_scores[,2] 
+ } 
> print(score_matrix) 
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] 
[1,] -0.8688788 0.4524561 1.4594981 -1.0552725 -0.1594024   NA -0.4122056   NA   NA   NA 
[2,] -1.0552725 1.5064965   NA -1.8956159   NA   NA   NA 0.7000265   NA   NA 
[3,]   NA   NA   NA -0.7334736   NA   NA -1.8381591 -1.8381591 -0.7334736   NA 
[4,]   NA   NA -0.3910595 1.5064965   NA -0.1006090 0.1064373 0.4524561   NA   NA 
[5,]   NA   NA 0.8967748   NA   NA   NA   NA 0.8458699 -0.1006090   NA 
[6,]   NA   NA -1.8381591 -1.8381591 -0.7334736   NA   NA   NA   NA   NA 
[7,] -1.3803871 -1.5606603   NA 0.8967748 -0.7036330   NA   NA   NA   NA -1.6780760 
[8,]   NA   NA   NA   NA 0.8458699 -0.1006090   NA -1.5606603   NA   NA 
[9,]   NA   NA   NA -1.3673480 1.8448811 1.1304699   NA -0.8317189 0.1064373 -1.4426410 
[10,] 0.8967748 -0.7036330   NA   NA   NA   NA -1.6780760 -0.3910595   NA   NA 
[11,] 1.1304699   NA   NA -1.0552725 1.5064965   NA -1.8956159   NA   NA   NA 
[12,]   NA   NA -1.6780760 -0.7036330   NA   NA 0.8967748   NA   NA   NA 
[13,]   NA   NA   NA -1.6780760 -0.7036330   NA   NA 0.8967748   NA   NA 
[14,]   NA 0.8458699 -0.1006090   NA -1.5606603   NA   NA 0.8458699 -0.1594024   NA 
[15,] -0.1006090   NA -1.5606603   NA   NA 0.8458699 -0.1594024   NA   NA -0.3910595 
[16,] 1.8448811 1.1304699   NA -0.8317189 0.1064373 -1.4426410   NA 1.8448811   NA -1.4426410 
[17,]   NA   NA -1.0552725 1.5064965   NA -1.8956159   NA   NA   NA 0.7000265 
[18,]   NA   NA   NA   NA   NA -1.3673480 1.8448811 1.1304699   NA -0.8317189 
[19,]   NA   NA   NA   NA -0.7334736   NA   NA -1.8381591 -1.8381591 -0.7334736 
[20,]   NA   NA 0.7000265   NA   NA   NA -0.8688788 0.4524561 1.4594981 -1.0552725 
[21,]   NA -1.3803871 -1.5606603   NA 0.8967748 -0.7036330   NA   NA   NA   NA 
[22,] -0.7334736   NA   NA   NA   NA   NA -1.3673480 1.8448811 1.1304699   NA 
[23,]   NA   NA   NA   NA -1.3673480 1.8448811 1.1304699   NA -0.8317189 0.1064373 
[24,]   NA 1.4594981 0.7000265   NA -1.3673480 -0.8688788 1.1304699   NA   NA -1.0552725 
[25,] -0.8317189 0.1064373 -1.4426410   NA 1.8448811   NA -1.4426410 -1.8956159   NA 1.4594981 

不過,我將其應用到數據集是非常大的,和腳本大約需要24小時才能完成。有沒有辦法更有效地完成同樣的事情?

感謝,

+0

你可以嘗試在Rcpp中實現你的算法。 我在Rcpp中寫了一個類似的算法,Rcpp的效率比R快22.8倍。 – wush978

+0

Am在理解這個邏輯時遇到了BIG問題。你用不同的data1創建了score_matrix嗎?因爲data1中的數字沒有出現在它中。那麼我們如何測試其他版本呢? – Spacedman

+0

我的中心矢量可以達到10,000左右。我的data1可以高達2000,000行左右 – dlv

回答

2

我實現RCPP你的函數:

data1 <-data.matrix(data.frame("coord"=sample(1:100,50),"scores"=rnorm(25))) 
centers <- unique(data1[,1]) 
score_matrix<-matrix(nrow=length(centers),ncol=10) 
for(i in 1:length(centers)){ 
data2 <- data1 
data2[,1] <- data2[,1] - centers[i] + ncol(score_matrix)/2 
region_scores <- subset(data2,data2[,1] > 0 & data2[,1] <= ncol(score_matrix)) 
score_matrix[i,region_scores[,1]]<-region_scores[,2] 
} 

library(inline) 
library(Rcpp) 

src <- ' 
    NumericMatrix data1(Rdata1); 
    NumericVector centers(Rcenters); 
    NumericMatrix score_matrix(Rscore_matrix); 
    NumericVector data2(data1.nrow()); 
    for(int i = 0;i < centers.size();i++) { 
    data2 = data1.column(0); 
    data2 = data2 - centers(i) + score_matrix.ncol()/2; 
    for(int j = 0, k = 0;j < data2.size();j++) { // subset part 
     if (data2(j) <= 0) 
     continue; 
     if (data2(j) > score_matrix.ncol()) 
     continue; 
     score_matrix(i, data2(j) - 1) = data1(j,1); 
    } 
    } 
    return score_matrix; 
' 

f <- cxxfunction(sig=c(Rdata1 = "numeric", Rcenters = "numeric", Rscore_matrix = "numeric"), 
    plugin="Rcpp", body=src) 

score_matrix2<-matrix(nrow=length(centers),ncol=10) 
score_matrix2 <- f(data1, centers, score_matrix2) 
all.equal(score_matrix, score_matrix2) 

library(rbenchmark) 

benchmark({ 
    score_matrix<-matrix(nrow=length(centers),ncol=10) 
    for(i in 1:length(centers)){ 
    data2 <- data1 
    data2[,1] <- data2[,1] - centers[i] + ncol(score_matrix)/2 
    region_scores <- subset(data2,data2[,1] > 0 & data2[,1] <= ncol(score_matrix)) 
    score_matrix[i,region_scores[,1]]<-region_scores[,2] 
    } 
}, { 
    score_matrix2<-matrix(nrow=length(centers),ncol=10) 
    score_matrix2 <- f(data1, centers, score_matrix2) 
}) 

RCPP一個是除了R的速度更快我的機器上的最後一次測試約12倍。


如果速度不夠快,可能需要並行化算法。

嘗試R包snow並重新設計劃分數據和合並結果的算法。

+0

謝謝。在12芯機器上,這對我來說快了大約7倍。我有一個32芯機器在我的分配。你認爲嘗試並行化會改善這一點嗎? – dlv

+0

如果你有足夠的內存來保存對象,你的算法可以並行化,而不需要太多的努力。 請注意,R的並行包通常會產生許多子進程來計算。每個進程都需要在內存中保存一份數據。如果內存耗盡,那麼計算將會非常緩慢。 – wush978

1

一個R重新實現的速度提高了47倍。這是我實現的原密碼

f0 <- function(data1, centers) { 
    score_matrix <- matrix(nrow=length(centers), ncol=10) 
    for(i in seq_along(centers)) { 
     data2 <- data1 
     data2[,1] <- data2[,1] - centers[i] + ncol(score_matrix)/2 
     idx <- data2[,1] > 0 & data2[,1] <= ncol(score_matrix) 
     region_scores <- data2[idx,] 
     score_matrix[i,region_scores[,1]] <- region_scores[,2] 
    } 
    score_matrix 
} 

我掏出通用計算(的centers[i] - ncol(score_matrix)/2,和子集data1),以獲得

f1 <- function(data1, centers, ncol=10) { 
    score_matrix <- matrix(NA_real_, length(centers), ncol) 
    ccenters <- centers - ncol/2 
    d1 <- data1[,1] 
    d2 <- data1[,2] 
    for (i in seq_along(ccenters)) { 
     score <- d1 - ccenters[i] 
     idx <- score > 0 & score <= ncol 
     score_matrix[i, score[idx]] <- d2[idx] 
    } 
    score_matrix 
} 

for循環應該編譯好

library(compiler) 
f1c <- cmpfun(f1) 

library(rbenchmark) 
data1 <- data.frame(coord=sample(100,50), scores=rnorm(25)) 
centers <- sort(scan(textConnection("39 31 61 16 48 82 42 76 71 43 93 35 6 
            100 67 81 70 79 45 17 96 78 69 95 29"))) 

> identical(f0(data1, centers), f1(data1, centers)) 
[1] TRUE 
> identical(f0(data1, centers), f1c(data1, centers)) 
[1] TRUE 
> benchmark(f0(data1, centers), f1(data1, centers), f1c(data1, centers), 
+   replications=10, columns=c("test", "elapsed", "relative")) 
       test elapsed relative 
1 f0(data1, centers) 0.139 46.333 
3 f1c(data1, centers) 0.003 1.000 
2 f1(data1, centers) 0.005 1.667 

目標在原來的問題看起來有點不完整的 - 你有什麼打算以10000×10矩陣辦?

相關問題