2012-11-18 43 views
0

我希望有人可以快速瀏覽一下這個例子,並幫助我找到更好的方法來解決這個問題。我想運行一個模擬來檢查動物在一系列特定條件下如何在網站之間移動。我有5個站點和一些初步的概率,使用函數替換多個for循環

N<-5 # number of sites 
sites<-LETTERS[seq(from=1,to=N)] 
to.r<-rbind(sites) 

p.move.r<-seq.int(0.05,0.95,by=0.1) # prob of moving to a new site 
p.leave<-0.01*p.move.r # prob of leaving the system w/out returning 
p.move.out<-0.01*p.move.r # prob of moving in/out 
p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

在這個例子中,我只包括50個模擬,但在現實中,我想至少有1000模擬,

set.seed(13973) 

reps<-50 # number of replicates/simulations 
steps<-100 # number of time steps (hours, days, weeks, etc) 
random<-runif(10000,0,1) # generating numbers from a random distribution 

# Construct empty df to fill with data 

rep.movements<-matrix(NA,nrow=reps,ncol=steps) 
colnames(rep.movements)<-c(1:steps);rownames(rep.movements)<-c(1:reps) 

rep.use<-matrix(NA,nrow=reps,ncol=N) 
colnames(rep.use)<-c(reefs);rownames(rep.use)<-c(1:reps) 

# Outer loop to run each of the initial parameters 

for(w in 1:length(p.stay)){ 
    p.move<-matrix((p.move.r[w]/(N-1)),N,N) 
    diag(p.move)<-0 

# Construction of distance matrix 
move<-matrix(c(0),nrow=(N+2),ncol=(N+2),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) 
from<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) 
to<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left"))) 

# Filling movement-Matrix construction 

for(from in 1:N){ 
    for(to in 1:N){ 
     if(from==to){move[from,to]<-p.stay[w]} else {move[from,to]<-p.move[from,to]} 
     move[,(N+1)]<-(1-(p.leave[w]+p.move.out[w]))/N 
     move[,(N+2)]<-(1-(p.leave[w]+p.move.out[w]))/N 
     move[(N+1),]<-p.move.out[w] 
     move[(N+2),]<-p.leave[w] 
} 

}

的想法是使用此累積概率矩陣以基於隨機數的動物的命運,

cumsum.move<-cumsum(data.frame(move)) # Cumulative sum of probabilities 

在此累積矩陣,字母 「A」, 「B」, 「C」, 「d」 和 「E」 代表不同的網站,「NA」表示離開和回到未來時間步的概率,「左」表示離開系統而不回來的概率。然後我使用隨機數列表來比較累積概率矩陣並確定該特定動物的「命運」。

爲(○在1:代表){

result<-matrix(as.character(""),steps) # Vector for storing sites 
x<-sample(random,steps,replace=TRUE) # sample array of random number 
time.step<-data.frame(x) # time steps used in the simulation (i) 
colnames(time.step)<-c("time.step") 
time.step$event<-"" 

j<-sample(1:N,1,replace=T) # first column to be selected 
k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out 

for(i in 1:steps){ 
    for (t in 1:(N+1)){ 
    if(time.step$time.step[i]<cumsum.move[t,j]){ 
    time.step$event[i]<-to.r[t] 
    break 
    } 
} 

ifelse(time.step$event[i]=="",break,NA) 
result[i]<-time.step$event[i] 
j<-which(to.r==result[i]) 
if(length(j)==0){j<-k} 
} 

result<-time.step$event 

# calculate frequency/use for each replicate 

use<-table(result) 
use.tab<-data.frame(use) 
use.tab1<-use.tab[-which(use.tab==""),] 
mergeuse<-merge(use.tab2,use.tab,all.x=TRUE) 
mergeuse[is.na(mergeuse)]<-0 

# insert data into empty matrix 

rep.movements[o,]<-result 
rep.use[o,]<-mergeuse$Freq 

} 

} 
# for the outer loop I have some matrices to store the results for each parameter, 
    # but for this example this is not important 
rep.movements 
rep.use 

現在,主要問題是,它需要長期運行所有的模擬對於每個初始參數(本例中爲10個值)。我需要找到一種更好/更有效的方法來在所有初始參數中運行1000次模擬/ 20個站點。我不太熟悉功能或其他方法來加速這項任務。任何想法或建議將不勝感激。

非常感謝,

回答

1

讓我們先把你的代碼包裝在一個函數中。我還添加了set.seed命令以使結果重現。在運行模擬之前,您需要刪除它們。

sim1 <- function(reps=50, steps=100) { 

    N<-5 # number of sites 
    sites<-LETTERS[seq(from=1,to=N)] 
    to.r<-rbind(sites) 

    p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site 
    p.leave<-0.01*p.move.r # prob of leaving the system w/out returning 
    p.move.out<-0.01*p.move.r # prob of moving in/out 
    p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

    set.seed(42) 
    random<-runif(10000,0,1) # generating numbers from a random distribution 

    cumsum.move <- read.table(text="A   B   C   D   E NA. left 
          A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964 
          B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928 
          C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892 
          D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856 
          E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820 
          NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910 
          left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE) 

    cumsum.move <- as.matrix(cumsum.move) 

    for(o in 1:reps){ 

    result<-matrix(as.character(""),steps) # Vector for storing sites 
    set.seed(42) 
    x<-sample(random,steps,replace=TRUE) # sample array of random number 
    time.step<-data.frame(x) # time steps used in the simulation (i) 
    colnames(time.step)<-c("time.step") 
    time.step$event<-"" 

    set.seed(41) 
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40) 
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out 

    for(i in 1:steps){ 
     for (t in 1:(N+1)){ 
     if(time.step$time.step[i]<cumsum.move[t,j]){ 
      time.step$event[i]<-to.r[t] 
      break 
     } 
     } 

     ifelse(time.step$event[i]=="",break,NA) 
     result[i]<-time.step$event[i] 
     j<-which(to.r==result[i]) 
     if(length(j)==0){j<-k} 
    } 

    result<-time.step$event 
    } 
    result 
} 

注意result在每次迭代在Ø期間覆蓋。我不認爲你想要那樣,所以我解決了這個問題。此外,您在循環中使用data.frame。作爲一般規則,你應該避免data.frames像瘟疫一樣的內部循環。儘管它們非常方便,但在效率方面卻很糟糕。

sim2 <- function(reps=50, steps=100) { 

    N<-5 # number of sites 
    sites<-LETTERS[seq(from=1,to=N)] 
    to.r<-rbind(sites) 

    p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site 
    p.leave<-0.01*p.move.r # prob of leaving the system w/out returning 
    p.move.out<-0.01*p.move.r # prob of moving in/out 
    p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

    set.seed(42) 
    random<-runif(10000,0,1) # generating numbers from a random distribution 

    cumsum.move <- read.table(text="A   B   C   D   E NA. left 
          A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964 
          B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928 
          C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892 
          D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856 
          E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820 
          NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910 
          left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE) 

    cumsum.move <- as.matrix(cumsum.move) 

    res <- list() 
    for(o in 1:reps){ 

    result<-character(steps) # Vector for storing sites 
    set.seed(42) 
    time.step<-sample(random,steps,replace=TRUE) # sample array of random number 
    #time.step<-data.frame(x) # time steps used in the simulation (i) 
    #colnames(time.step)<-c("time.step") 
    #time.step$event<-"" 
    event <- character(steps) 

    set.seed(41) 
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40) 
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out 

    for(i in 1:steps){ 
     for (t in 1:(N+1)){ 
     if(time.step[i]<cumsum.move[t,j]){ 
      event[i]<-to.r[t] 
      break 
     } 
     } 

     ifelse(event[i]=="",break,NA) 
     result[i]<-event[i] 
     j<-which(to.r==result[i]) 
     if(length(j)==0){j<-k} 
    } 

    res[[o]]<-event 
    } 
    do.call("rbind",res) 
} 

這兩個函數是否給出了相同的結果?

res1 <- sim1() 
res2 <- sim2() 
all.equal(res1,res2[1,]) 
[1] TRUE 

新版本更快嗎?

library(microbenchmark) 
microbenchmark(sim1(),sim2()) 

Unit: milliseconds 
    expr  min  lq median  uq  max 
1 sim1() 204.46339 206.58508 208.38035 212.93363 269.41693 
2 sim2() 77.55247 78.39698 79.30539 81.73413 86.84398 

那麼,三個因素已經相當不錯了。由於那些break s,我看不到進一步改進循環的多種可能性。這隻剩下並行化作爲一種​​選擇。

sim3 <- function(ncore=1,reps=50, steps=100) { 
    require(foreach) 
    require(doParallel) 


    N<-5 # number of sites 
    sites<-LETTERS[seq(from=1,to=N)] 
    to.r<-rbind(sites) 

    p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site 
    p.leave<-0.01*p.move.r # prob of leaving the system w/out returning 
    p.move.out<-0.01*p.move.r # prob of moving in/out 
    p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

    set.seed(42) 
    random<-runif(10000,0,1) # generating numbers from a random distribution 

    cumsum.move <- read.table(text="A   B   C   D   E NA. left 
          A 0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964 
          B 0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928 
          C 0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892 
          D 0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856 
          E 0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820 
          NA. 0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910 
          left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE) 

    cumsum.move <- as.matrix(cumsum.move) 

    #res <- list() 
    #for(o in 1:reps){ 
    cl <- makeCluster(ncore) 
    registerDoParallel(cl) 
    res <- foreach(1:reps) %dopar% { 

    result<-character(steps) # Vector for storing sites 
    set.seed(42) 
    time.step<-sample(random,steps,replace=TRUE) # sample array of random number 
    #time.step<-data.frame(x) # time steps used in the simulation (i) 
    #colnames(time.step)<-c("time.step") 
    #time.step$event<-"" 
    event <- character(steps) 

    set.seed(41) 
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40) 
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out 

    for(i in 1:steps){ 
     for (t in 1:(N+1)){ 
     if(time.step[i]<cumsum.move[t,j]){ 
      event[i]<-to.r[t] 
      break 
     } 
     } 

     ifelse(event[i]=="",break,NA) 
     result[i]<-event[i] 
     j<-which(to.r==result[i]) 
     if(length(j)==0){j<-k} 
    } 

    #res[[o]]<-event 
    event 
    } 
    stopCluster(cl) 
    do.call("rbind",res) 
} 

相同的結果?

res3 <- sim3() 
all.equal(res1,c(res3[1,])) 
[1] TRUE 

更快? (讓我們用4個核在我的Mac,你可以嘗試以訪問多帶幾個核心的服務器。)

microbenchmark(sim1(),sim2(),sim3(4)) 
Unit: milliseconds 
    expr  min   lq  median   uq  max 
1 sim1() 202.28200 207.64932 210.32582 212.69869 255.2732 
2 sim2() 75.39295 78.95882 80.01607 81.49027 125.0866 
3 sim3(4) 1031.02755 1046.41610 1052.72710 1061.74057 1091.2175 

這看起來可怕。但是,該測試對並行功能不公平。該函數被調用100次,只有50次重複。這意味着我們會得到並行化的所有開銷,但幾乎沒有任何好處。讓我們更公平一點:

microbenchmark(sim1(rep=10000),sim2(rep=10000),sim3(ncore=4,rep=10000),times=1) 
Unit: seconds 
          expr  min  lq median  uq  max 
1   sim1(rep = 10000) 42.16821 42.16821 42.16821 42.16821 42.16821 
2   sim2(rep = 10000) 16.13822 16.13822 16.13822 16.13822 16.13822 
3 sim3(ncore = 4, rep = 10000) 38.18873 38.18873 38.18873 38.18873 38.18873 

更好,但仍然不會令人印象深刻。如果重複次數和步數進一步增加,並行功能看起來不錯,但我不知道你是否需要這個功能。

+0

非常感謝Roland,我實際上對原始代碼做了一些編輯,以包含我正在使用的代碼(它也有一些額外的循環...)的更詳細/可重複的示例。我要先檢查你的腳本。感謝您的輸入! – user1626688

+0

嗨,羅蘭,我不確定這段代碼在包含更詳細的編輯時是否有用......因爲我必須爲每個參數運行模擬。 – user1626688

+0

我相信你的新代碼可以用類似的方式進行優化。主要觀點是使用高效的數據結構(向量或列表)來存儲值。使用'Rprof'函數找出哪個操作花費最多時間,並嘗試優化該代碼。儘可能使用矢量化函數。並看看並行化。 – Roland