2013-10-27 40 views
-1

我想循環遍歷一個數組,遞增200.我在該200中創建一個子數組,然後將該子數組上的計算移動到下一組200並重復。但是,每當我進行第二次計算時,新的子數組就包含第一個子數組的值。這就像我不能重置或undef子陣列在for循環的下一次迭代中重新使用它。Perl用於循環和作用域變量

完整的代碼可以在GitHub:https://github.com/bsima/yeast-TRX

運行的代碼,你需要以下基因組文件在同一目錄的腳本:https://raw.github.com/bsima/yeast-TRX/master/data/bayanus/genome.csv

下面是相關代碼:

#!/usr/bin/perl 

use strict; 
use warnings; 
use diagnostics; 
use Statistics::Descriptive; 

my $species = "bayanus"; 

open(SPECIES, "<./genome.csv") || die "Cannot open file: $!\n"; 
my @text = <SPECIES>; 

my $geneNameRe = qr/(eY\w{5}[CW])/; 
my $geneRe  = qr/,([atgcATGC-]+)/; 

foreach my $line (@text) { 
    chomp($line); 
    if (defined $line && $line =~ m/e.{4,},[acgtACGT-]+/) { 

     my $gene  = match($geneRe,$line); 
     my $geneName = match($geneNameRe,$line); 

     # Smoothing function 
     # 
     # This moves through the gene data and counts the position until it arrives 
     # at the end of the smoothing window (e.g. 200). Then it calculates the average 
     # of the selected data set and outputs it into the respecive `smooth.csv` file 
     # in the following format, to be read later by R's graphing functions: 
     #  gene,position,trx.mean,energy.mean 
     open(my $smooth, ">>./smooth.csv") || die "Cannot open file $!"; 
     my $smoothingWindow = 200; 
     # Loop through every 200 characters 
     for (my $smoothing = 0; $smoothing < length($gene); $smoothing=$smoothing+$smoothingWindow) { 

      my @trxValues; 
      my @energyScores; 

      # Loop through every character 
      for (my $position = 0; $position <= $smoothingWindow; $position++) { 
       my $dinucleotide = substr($gene,$position,2); 

       if ($dinucleotide =~ m/[actgACTG]{2}/) {  
        my $trxValue = trxScore($dinucleotide); 
        my $energyScore = energyScore($dinucleotide); 

        #print "trxValue = $trxValue\n"; 
        push @trxValues, $trxValue; 
        push @energyScores, $energyScore; 
       }      
      } 

      # Now run the calculations and print to SMOOTH 
      my $trxStat = Statistics::Descriptive::Full->new(); 
      $trxStat->add_data(@trxValues); 
      my $trxMean = $trxStat->mean(); 
      my $energyStat = Statistics::Descriptive::Full->new(); 
      $energyStat->add_data(@energyScores); 
      my $energyMean = $energyStat->mean(); 

      print "$geneName $smoothing: TRX Values @trxValues\n"; 
      print "$geneName $smoothing: TRX Mean is $trxMean.\n"; 

      print $smooth $geneName . "," . $smoothing . "," . $trxMean . "," . $energyMean . "\n"; 

     } 
     close $smooth; 
    } 
} 

# This just makes it easy to do regex matches 
sub match { 
     my ($re, $text) = @_; 
     if ($text =~ $re) { 
       return $1; 
     } 
} 

# @name trxScore 
# @description Calculates and returns the TRX value of a given phosphate linkage 
# @param $dinucleotide {string} The nucleotide to be checked 
# @return {integer} The TRX value 
sub trxScore { 

    my ($dinucleotide) = @_; 

     my %trxScores = (
      qr/(CG)/ => 43, 
      qr/(CA)/ => 42, 
      qr/(TG)/ => 42, 
      qr/(GG)/ => 42, 
      qr/(CC)/ => 42, 
      qr/(GC)/ => 25, 
      qr/(GA)/ => 22, 
      qr/(TC)/ => 22, 
      qr/(TA)/ => 14, 
      qr/(AG)/ => 9, 
      qr/(CT)/ => 9, 
      qr/(AA)/ => 5, 
      qr/(TT)/ => 5, 
      qr/(AC)/ => 4, 
      qr/(GT)/ => 4, 
      qr/(AT)/ => 0, 
      qr/(.-)/ => 0, 
      qr/(-.)/ => 0 # These last two are necessary for dealing with missing values in the genomes 
     ); 

    foreach my $re (keys %trxScores) { 
     if (match($re,$dinucleotide)) { 
      return $trxScores{$re}; 
     } 
    } 
    return 0; 
} 

# @name energyScore 
# @description Calculates and returns the delta-E value of a given phosphate linkage 
# @param $dinucleotide {string} The nucleotide to be checked 
# @return {integer} The delta-E value 
sub energyScore { 

    my ($dinucleotide) = @_; 

     my %energyScores = (
      qr/(AA)/ => -18.5, 
      qr/(AC)/ => -19.0, 
      qr/(AG)/ => -23.6, 
      qr/(AU)/ => -15.7, 
      qr/(CA)/ => -20.0, 
      qr/(CC)/ => -21.4, 
      qr/(CG)/ => -26.9, 
      qr/(CU)/ => -17.2, 
      qr/(GA)/ => -23.7, 
      qr/(GC)/ => -22.9, 
      qr/(GG)/ => -24.3, 
      qr/(GU)/ => -18.9, 
      qr/(UA)/ => -19.6, 
      qr/(UC)/ => -28.2, 
      qr/(UG)/ => -23.3, 
      qr/(UU)/ => -15.8, 
      qr/(.-)/ =>  0, 
      qr/(-.)/ =>  0 # These last two are necessary for dealing with missing values in the genomes 
     ); 

    foreach my $re (keys %energyScores) { 
     if (match($re,$dinucleotide)) { 
      return $energyScores{$re}; 
     } 
    } 
    return 0; 
} 

如果運行大型腳本代碼塊的一部分,將打印的@trxValues內容到終端,你會看到$smoothing變量遞增正確,所以for循環的步驟部分沒有任何問題,但@trxValues的內容不會更改。因此,在第一個for循環結束和第二個循環開始之間的某個位置,數組@trxValues未被正確清除。同樣的事情發生在@energyScores

對此有何意見?我嘗試在for循環的末尾使用undef,並在循環結束時將@trxValues設置爲零,但兩者均無效。我完全被難住了。

編輯:將代碼塊更改爲可自行運行。

+1

你需要顯示最小的工作示例,我不認爲有人想從github運行你的代碼 – Suic

+3

這遠遠超過相關的代碼。你能否把你發佈的內容翻譯成實際相關的內容?這是調試你不明白的問題的第一步,所以你應該在任何情況下這樣做! – ikegami

+0

你發佈的內容甚至不是可運行的,這意味着你必須格外清楚,首先要說明爲什麼你認爲它沒有被清除。你發佈的內容都沒有表明你甚至檢查過。 – ikegami

回答

0

您似乎總是在內循環中使用相同的第一個平滑窗口。

嘗試改變:

for (my $position = 0; $position <= $smoothingWindow; $position++) { 

for (my $position = $smoothing; $position < $smoothing + $smoothingWindow; $position++) { 

雖然內環應該對陣雙方平滑窗口和length($gene)進行檢查(除非你知道它是永遠的偶數倍的200?)

這似乎也很奇怪,你一直在length($gene)一個位置,但你的substr正在提取兩個charact ERS。

+0

其實我認爲這工作。 我一次往上一個位置,但是我的'substr'正在提取兩個字符,因爲我需要分析每一對字符。如果我分析AGTA,我需要分析AG和GT以及TA。如果我將「for」循環增加2個字符,我會錯過GT配對。 – Ben

+0

你的直覺是正確的。在腳本正常工作的時候,我收到一個錯誤,提示'scripttr.pl 206行以外的substr'字符串。因此,無論何時我到達'length($ gene)'末尾,'substr'都會失敗並拋出錯誤。我不知道如何解決這個問題 - 我想我必須做一些數學計算來將平滑調整爲基因長度的倍數。但無論如何謝謝你的幫助。 – Ben

+0

只是讓內循環檢查'$ position <$ smoothing + $ smoothingWindow && $ position ysth