2013-02-11 31 views
3
#!/usr/bin/perl 
use strict; 
use warnings; 
use Tie::File; 
use Data::Dumper; 
use Benchmark; 

my $t0 = Benchmark->new; 

# all files in the current folder with $ext will be input. 
# Default $ext is "pileup" 
# if entered, second user entered input will be set to $ext 
my $ext = "pileup"; 
if(exists $ARGV[1]) { 
    $ext = $ARGV[1]; 
} 

# open current directory & store filenames with $ext into @pileupfiles 
opendir (DIR, "."); 
my @pileupfiles = grep {-f && /\.$ext$/} readdir DIR; 

my $dnasegment; 
my $pos; 
my $total; 
my $g_total; 
my @index; #hold current index for each tied file 
my @totalfiles; #hold total files in each sub-index 

# $filenum is iterator to cycle through all pileup files whose names are stored in pileupfiles 
my $filenum = 0; 
# @tied is an array holding all arrays of tied files 
my @tied; 
# array of the current line number for each @file, 
my @linenum; 
# tie each file to an array that is an element of the @tied array 
while($filenum < scalar @pileupfiles) { 
    my @file; 
    tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die; 
    push(@tied, [@file]); 
    # set each line's value of $linenum to 0 
    push(@linenum, 0); 
    $filenum++; 
} 

# open user list of dnasegments 
open(LIST, $ARGV[0]); 
# open file for output 
open(OUT, ">>tempfile.tab"); 

while(<LIST>) { 
    $dnasegment = $_; 
    chomp $dnasegment; 

    my $exit = 0; 
    $pos = 1; 
    my %flag; 

    while(scalar(keys %flag) < scalar @tied) { 
     $total = 0; 
     $filenum = 0; 
     while($filenum < scalar @tied) { 
      if(exists $tied[$filenum][$linenum[$filenum]]) { 
       my @line = split(/\t/, $tied[$filenum][$linenum[$filenum]]); 
       #print $line[0], "\t", $line[1], "\t", $line[3], "\n\n"; 
       if($line[0] eq $dnasegment) { 
        if($line[1] == $pos) { 
         $total += $line[3]; 
         $linenum[$filenum]++; 
         $g_total += $line[3]; 
         print OUT "$dnasegment\t$filenum\t$pos\t$line[3]\n"; 
        } 
       } else { 
        $flag{$filenum} = 1; 
       } 
      } else { 
       #print $flag, "\n"; 
       $flag{$filenum} = 1; 
      } 
      $filenum++; 
     } 
     if($total > 0) { 
      print OUT "$dnasegment\t$total\n"; 
     } 
     $pos++; 
    } 
} 

close (LIST); 
close(OUT); 

my $t1 = Benchmark->new; 
my $td = timediff($t1, $t0); 
print timestr($td), "\n"; 

上述代碼將所有帶缺省或用戶輸入文件擴展名的文件都放入一個目錄中,並計算特定條目(列的輸入文件的第2列)的總髮生次數(輸入文件的第4列)其中第1列與命令行中提供的文件中包含的名稱相匹配的輸入文件中的1個)。要由程序使用的文件的佈局是: 文件1:爲什麼我的程序使用Tie :: File如此緩慢地運行?

Gm02 11896804 G 2 ., \' 
    Gm02 11896805 G 7 ......, U` 
    Gm02 11896806 G 3 .,. Sa 
    Gm02 11896807 T 2 ., U\ 
    Gm02 11896808 T 2 ., ZZ 
    Gm02 11896809 T 2 ., ZZ 
    Gm02 11896810 T 2 ., B\ 
    Gm02 11896811 G 3 .,^!, B]E 
    Gm02 11896812 A 3 T,, BaR 
    Gm02 11896822 G 3 .,, B`D 

文件2:

Gm02 11896804 G 3 .,, \' 
    Gm02 11896805 G 7 ......, U` 
    Gm02 11896806 G 3 .,. Sa 
    Gm02 11896807 T 2 ., U\ 
    Gm02 11896808 T 2 ., ZZ 
    Gm02 11896809 T 2 ., ZZ 
    Gm02 11896810 T 2 ., B\ 
    Gm02 11896811 G 3 .,^!, B]E 
    Gm02 11896812 A 3 T,, BaR 
    Gm02 11896813 G 3 .,, B`D 

文件3:

Gm02 11896804 G 3 .,, \' 
    Gm02 11896805 G 7 ......, U` 
    Gm02 11896806 G 3 .,. Sa 
    Gm02 11896807 T 2 ., U\ 
    Gm02 11896808 T 2 ., ZZ 
    Gm02 11896809 T 2 ., ZZ 
    Gm02 11896810 T 2 ., B\ 
    Gm02 11896811 G 3 .,^!, B]E 
    Gm02 11896812 A 3 T,, BaR 
    Gm02 11896833 G 3 .,, B`D 

在這種情況下,唯一的命令傳遞給程序的行參數將是一個以「Gm02」作爲其內容的文本文件。

散列用於跟蹤已經處理過的位置。在上面的示例文件中,所有三個文件都將在位置1至11896803之間進行檢查,以便在位置11896804處遇到第一個值之前進行計數。這是爲了確保在位置遞增之前在所有文件中檢查和彙總所有位置。

我的問題與表現有關。我決定使用Tie :: File,因爲我的理解是這會提高性能,因爲所有的文件都不會被讀入內存。由程序處理的真實數據是數十萬行長度乘以數十個文件。此時,單獨運行示例file1以及運行全部3個示例文件的時間分別爲42 wallclock秒(41.96 usr + 0.00 sys = 41.96 CPU)和110 wallclock secs(109.76 usr + 0.00 sys = 109.76 CPU)。任何關於爲什麼這個程序運行得如此緩慢的信息或者關於如何加速它的建議都將非常感激。

編輯下午10點17 EST: 從程序的輸出如下:

Gm02 0 11896804 2 
Gm02 1 11896804 3 
Gm02 2 11896804 3 
Gm02 8 
Gm02 0 11896805 7 
Gm02 1 11896805 7 
Gm02 2 11896805 7 
Gm02 21 
Gm02 0 11896806 3 
Gm02 1 11896806 3 
Gm02 2 11896806 3 
Gm02 9 
Gm02 0 11896807 2 
Gm02 1 11896807 2 
Gm02 2 11896807 2 
Gm02 6 
Gm02 0 11896808 2 
Gm02 1 11896808 2 
Gm02 2 11896808 2 
Gm02 6 
Gm02 0 11896809 2 
Gm02 1 11896809 2 
Gm02 2 11896809 2 
Gm02 6 
Gm02 0 11896810 2 
Gm02 1 11896810 2 
Gm02 2 11896810 2 
Gm02 6 
Gm02 0 11896811 3 
Gm02 1 11896811 3 
Gm02 2 11896811 3 
Gm02 9 
Gm02 0 11896812 3 
Gm02 1 11896812 3 
Gm02 2 11896812 3 
Gm02 9 
Gm02 1 11896813 3 
Gm02 3 
Gm02 0 11896822 3 
Gm02 3 
Gm02 2 11896833 3 
Gm02 3 
Gm02 0 11896804 2 
Gm02 1 11896804 3 
Gm02 5 
Gm02 0 11896805 7 
Gm02 1 11896805 7 
Gm02 14 
Gm02 0 11896806 3 
Gm02 1 11896806 3 
Gm02 6 
Gm02 0 11896807 2 
Gm02 1 11896807 2 
Gm02 4 
Gm02 0 11896808 2 
Gm02 1 11896808 2 
Gm02 4 
Gm02 0 11896809 2 
Gm02 1 11896809 2 
Gm02 4 
Gm02 0 11896810 2 
Gm02 1 11896810 2 
Gm02 4 
Gm02 0 11896811 3 
Gm02 1 11896811 3 
Gm02 6 
Gm02 0 11896812 3 
Gm02 1 11896812 3 
Gm02 6 
Gm02 1 11896813 3 
Gm02 3 
Gm02 0 11896822 3 
Gm02 3 
Gm02 0 11896804 2 
Gm02 2 
Gm02 0 11896805 7 
Gm02 7 
Gm02 0 11896806 3 
Gm02 3 
Gm02 0 11896807 2 
Gm02 2 
Gm02 0 11896808 2 
Gm02 2 
Gm02 0 11896809 2 
Gm02 2 
Gm02 0 11896810 2 
Gm02 2 
Gm02 0 11896811 3 
Gm02 3 
Gm02 0 11896812 3 
Gm02 3 
Gm02 0 11896822 3 
Gm02 3 
+1

從我的頭頂,我建議你運行它[傑韋利:: NYTProf(https://metacpan.org/module/Devel::NYTProf)以及它在說看看。 – simbabque 2013-02-11 21:02:12

+0

另外,我認爲'chomp $ dnasegment'這一行很可怕。 ;-) – simbabque 2013-02-11 21:04:21

+0

感謝您讓我知道Devel :: NYTProf。我以前沒有用過它。 – azzydood 2013-02-13 18:28:23

回答

6

我會說「因爲你使用領帶::文件」,但你是不是外面的下面的代碼行:

my @file; 
tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die; 
push(@tied, [@file]); 

你還不如寫了一個爲

open(my $fh, '<', $pileupfiles[$filenum]) or die $!; 
push(@tied, [ <$fh> ]); 

或許你的意思是

tie my @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die; 
push(@tied, \@file); 

然後我們會回到我原來的答案。 Tie :: File在某些情況下可能會縮短開發時間,但它不會成爲迄今爲止最快的解決方案,並且可能會使用更多所需的內存。


順便說一句,exist不會使數組元素的意義。

if (exists $tied[$filenum][$linenum[$filenum]]) 

是做

if (defined $tied[$filenum][$linenum[$filenum]]) 

if ($linenum[$filenum] < @{ $tied[$filenum] }) 
0

的好方法不知道你的輸出是什麼樣子。它會是這樣的,(給你上面的示例文件)?

$VAR1 = { 
      'Gm02;11896804' => 8, 
      'Gm02;11896805' => 21, 
      'Gm02;11896806' => 9, 
      'Gm02;11896807' => 6, 
      'Gm02;11896808' => 6, 
      'Gm02;11896809' => 6, 
      'Gm02;11896810' => 6, 
      'Gm02;11896811' => 9, 
      'Gm02;11896812' => 9, 
      'Gm02;11896813' => 3, 
      'Gm02;11896822' => 3, 
      'Gm02;11896833' => 3 
     }; 
+0

這是正確的。我在上面的修改中發佈了完整的輸出。 – azzydood 2013-02-12 14:28:36

相關問題