2013-02-06 99 views
1

我有類似下面幾個文件,我嘗試做數字分析的圖像中提到算法做數字輪廓線的

mumeric profiling methodology

>File Sample 
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg 
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 

我要地圖中的每個子尺寸爲2,然後將其映射到用於不同ptoperties 33值,然後添加作爲每5

my %temp = (
       aCount => { 
         aa =>2 
       } 
       cCount => { 
         aa => 0 
       } 
    ); 

我的當前實現的窗口大小包括按下面,

while (<FILE>) { 
    my $line = $_; 
    chomp $line; 

    while ($line=~/(.{2})/og) { 
     $subStr = $1; 
     if (exists $temp{aCount}{$subStr}) { 

      push @{$temp{aCount_array}},$temp{aCount}{$subStr}; 

      if (scalar(@{$temp{aCount_array}}) == $WINDOW_SIZE) { 

       my $sum = eval (join('+',@{$temp{aCount_array}})); 
       shift @{$temp{aCount_array}}; 
       #Similar approach has been taken to other 33 rules 
      } 

     } 

     if (exists $temp{cCount}{$subStr}) { 
      #similar approach 
     } 

     $line =~s/.{1}//og; 
    } 
    } 

有沒有其他的方法來提高整個過程

+0

這些數據代表什麼? – amphibient

+0

數據表示某些子字符串屬性的科學值,例如xray,aa值0.091 –

+1

我不相信您的問題是速度,因爲您顯示的代碼根本無法工作。目前還不清楚你想如何操作數據,但看起來你已經混淆了「aCount」和「cCount」。你的代碼只使用'acount',並且你使用'$ temp {aCount}'作爲散列引用和數組引用。 – Borodin

回答

0

正則表達式是真棒的速度,但也可以是大材小用時,所有你需要的是固定寬度的字符串。替代方案是substr

$len = length($line); 
for ($i=0; $i<$len; $i+=2) { 
    $subStr = substr($line,$i,2); 
    ... 
} 

unpack

foreach $subStr (unpack "(A2)*", $line) { 
    ... 
} 

我不知道快了多少任一會比正則表達式,但I know how I would find out