2017-07-09 48 views
1

這是一個非常簡單的生物信息學實現的自我對齊矩陣與它使用滑動窗口運算符​​兩次在字符串序列上循環兩次,並比較相同序列的每個fra2。 下面的代碼非常慢,不知道如何使用標準R語法加速它。在python中這將會超快,但是在R中需要1分鐘。通過同時分配i,jj,i,我已經將計算減少了一半。 任何加快思路?R函數超慢

sequence = 'MNLDIHCEQLSDARWTELLPLLQQYEVVRLDDCGLTEEHCKDIGSALRANPSLTELCLRTNELGDAGVHLVLQGLQSPTCKIQKLSLQNCSLTEAGCGVLPSTLRSLPTLRELHLSDNPLGDAGLRLLCEGLLDPQCHLEKLQLEYCRLTAASCEPLASVLRATRALKELTVSNNDIGEAGARVLGQGLADSACQLETLRLENCGLTPANCKDLCGIVASQASLRELDLGSNGLGDAGIAELCPGLLSPASRLKTLWLWECDITASGCRDL' 

if(!exists('BLOSUM50')){ 
    library(Biostrings) 
    data(BLOSUM50) 
    #BLOSUM50['A','N'] 
    } 

    windowSize<-24; 
    matrixSize<-nchar(sequence) - windowSize; 
    defaultValue = -10000000000; 
    scoreMatrix <- matrix(defaultValue, nrow = matrixSize, ncol = matrixSize); 

    for(i in 1:matrixSize){ 
    frag1 = substr(sequence,i,i+windowSize); 

    for(j in 1:matrixSize){ 

     frag2 = substr(sequence,j,j+windowSize); 

     totalScore = 0; 

     if(scoreMatrix[i,j] == defaultValue){ 
     for(x in 1:windowSize){ 
      totalScore = totalScore + BLOSUM50[substr(frag1,x,x),substr(frag2,x,x)]/windowSize; 
     } 

     scoreMatrix[i,j] = totalScore; 
     scoreMatrix[j,i] = totalScore; 
     } 

    } 
    } 
    return(scoreMatrix); 
+0

你先檢查子字符串,然後檢查是否已經填充了該變量。嘗試在'if(scoreMatrix ...){'語句中放置'frag2 = ....'。 – JAD

+0

此外,你做兩次子串,首先到'frag1/2',然後再次在最後的for-loop。你可以把它改爲'substr(sequence,i + x-1,i + x-1)'和'substr(sequence,j + x-1,j + x-1)'並保存一些函數調用。 – JAD

+0

你的'frags'的長度是否爲'windowSize + 1',但後來迭代'x'到'windowSize'是否正確? –

回答

2

你原來的代碼在我不是那麼新的筆記本電腦(聯想瑜伽2從2014年,R3.4)運行17秒。經過不那麼重的優化,這次減少到2秒。在計算開始時,我只是將sequence轉換爲矢量。之後,我改變了BLOSUM50中的名稱索引以通過數字索引進行索引。它導致了0.5秒的執行時間。請參閱下面的代碼和基準。

fun = function(sequence){ 
    windowSize<-24 
    matrixSize<-nchar(sequence) - windowSize 
    defaultValue = -10000000000 
    scoreMatrix <- matrix(defaultValue, nrow = matrixSize, ncol = matrixSize) 

    for(i in 1:matrixSize){ 
     frag1 = substr(sequence,i,i+windowSize) 

     for(j in 1:matrixSize){ 

      frag2 = substr(sequence,j,j+windowSize) 

      totalScore = 0 

      if(scoreMatrix[i,j] == defaultValue){ 
       for(x in 1:windowSize){ 
        totalScore = totalScore + BLOSUM50[substr(frag1,x,x),substr(frag2,x,x)]/windowSize 
       } 

       scoreMatrix[i,j] = totalScore 
       scoreMatrix[j,i] = totalScore 
      } 

     } 
    } 

    scoreMatrix 
} 

fun2 = function(sequence){ 
    windowSize<-24 
    sequence = unlist(strsplit(sequence, split = "")) 
    matrixSize<-length(sequence) - windowSize 
    defaultValue = -10000000000 
    scoreMatrix <- matrix(defaultValue, nrow = matrixSize, ncol = matrixSize) 

    for(i in 1:matrixSize){ 
     frag1 = sequence[i:(i+windowSize)] 

     for(j in 1:matrixSize){ 

      frag2 = sequence[j:(j+windowSize)] 

      totalScore = 0 

      if(scoreMatrix[i,j] == defaultValue){ 
       for(x in 1:windowSize){ 
        totalScore = totalScore + BLOSUM50[frag1[x],frag2[x]]/windowSize 
       } 

       scoreMatrix[i,j] = totalScore 
       scoreMatrix[j,i] = totalScore 
      } 

     } 
    } 

    scoreMatrix 
} 

fun3 = function(sequence){ 
    windowSize = 24 
    sequence = unlist(strsplit(sequence, split = "")) 
    matrixSize = length(sequence) - windowSize 
    scoreMatrix = matrix(NA, nrow = matrixSize, ncol = matrixSize) 
    sequence_index = match(sequence, colnames(BLOSUM50)) 
    for(i in seq_len(matrixSize)){ 
     frag1 = sequence_index[i:(i+windowSize - 1)] 

     for(j in seq_len(matrixSize)){ 

      frag2 = sequence_index[j:(j+windowSize - 1)] 

      if(is.na(scoreMatrix[i,j])){ 
       totalScore = sum(BLOSUM50[(frag2 - 1)*NROW(BLOSUM50) + frag1])/windowSize 
       scoreMatrix[i,j] = totalScore 
       scoreMatrix[j,i] = totalScore 
      } 

     } 
    } 

    scoreMatrix 
} 


if(!exists('BLOSUM50')){ 
    library(Biostrings) 
    data(BLOSUM50) 
    #BLOSUM50['A','N'] 
} 

sequence = 'MNLDIHCEQLSDARWTELLPLLQQYEVVRLDDCGLTEEHCKDIGSALRANPSLTELCLRTNELGDAGVHLVLQGLQSPTCKIQKLSLQNCSLTEAGCGVLPSTLRSLPTLRELHLSDNPLGDAGLRLLCEGLLDPQCHLEKLQLEYCRLTAASCEPLASVLRATRALKELTVSNNDIGEAGARVLGQGLADSACQLETLRLENCGLTPANCKDLCGIVASQASLRELDLGSNGLGDAGIAELCPGLLSPASRLKTLWLWECDITASGCRDL' 
library(microbenchmark) 
microbenchmark(
    original = fun(sequence), 
    fun2 = fun2(sequence), 
    fun3 = fun3(sequence), 
    times = 5 
) 

# Unit: milliseconds 
# expr   min   lq  mean  median   uq  max neval 
# original 16395.2108 16660.3295 17533.8563 16755.9680 17594.3596 20263.4137  5 
# fun2  1992.7731 2010.4031 2027.7953 2015.9592 2034.9022 2084.9390  5 
# fun3  472.0641 481.9267 496.2656 498.3259 506.6357 522.3755  5 
+0

太棒了!感謝您的建議。現在肯定更快。學會了一些「R」技巧。然而,矩陣對於結果'object.size(scoreMatrix)'爲488kb是巨大的。我可以指定矩陣類型來'float'以某種方式縮小它嗎?我真正的序列有時會返回18MB的矩陣......謝謝,EL –

+0

@ElDude據我所知R不支持單精度('float')數字。但是,您可以從'sum(...)/ windowSize'中移除'/ windowSize',並且函數的結果將是大小爲244kb的整數矩陣。當需要時,這個整數矩陣可以被「windowSize」分割。 –