2016-06-22 20 views
-1

我的代碼循環遍歷目錄中的多個文件,解析每個文件並將每個文件的解析內容附加到FinalVariantfile.txt用於每個循環的重複輸出

該代碼有效,但會複製每個文件的內容。

當我用兩個文件運行代碼時,輸​​出包含4個文件。有人可以解釋爲什麼會發生這種情況,以及如何解決這個問題?

#!/usr/bin/perl -w 

    use strict; 

    #directory structure 

    my $home   = "/data/"; 
    my $tsvdirectory = $home . "test_all_runs/" . $ARGV[0]; 
    my $tsvfiles  = $home . "test_all_runs/" . $ARGV[0] . "/tsv_files.txt"; 

    my $FinalVariants = $home . "test_all_runs/" . $ARGV[0] . "/FinalVariantfile.txt"; 

    my @tsvfiles  =(); 
    my @currentlines =(); 
    my $currentline  = ''; 
    my $currentCNVline = ''; 
    my @currentCNVlines =(); 
    my @HotSpotLines =(); 
    my @CNVLines  =(); 

    # command to produce the vcf_files.txt file stored in each individual run 
    # directory; the file list includes solely vcf files which have not been 
    # previously prepared and/or annotated 
    my $cmd = `ls $tsvdirectory/FOCUS*\.tsv > $tsvfiles`; 

    # print "$cmd"; 
    my $cmda = "ls $tsvdirectory/FOCUS*\.tsv > $tsvfiles"; 

    # print "$cmda"; 

    # this code opens the vcf_files.txt file and passes each line into an array for 
    # indidivudal manipulation 
    open(TXT2, "$tsvfiles"); 
    while (<TXT2>) { 
     push(@tsvfiles, $_); 
    } 
    close(TXT2); 

    foreach (@tsvfiles) { 
     chop($_); 
    } 

    # this code then parses each of the files listed by name in the tsvfiles array 
    foreach (@tsvfiles) { 

     my $currenttsvfile = "$_"; # establishes the current file being manipulated 

     my $MDLfinaltsvfile = $currenttsvfile; 
     $MDLfinaltsvfile =~ s/\.tsv/_prepared\.txt/g; 

     # this series of variable calls names the various intermediate or 
     # final output files 

     my $MDLlinestsvfile = $currenttsvfile; 
     $MDLlinestsvfile =~ s/\.tsv/_withCNV\.txt/g; 

     my $Variantlinestsvfile = $currenttsvfile; 
     $Variantlinestsvfile =~ s/\.tsv/_HotSpot\.txt/g; 

     my $MDLtsvfile = $currenttsvfile; 
     $MDLtsvfile =~ s/\.tsv/_FilteredAllcolumns\.txt/g; 

     my $MDLsampleid = $currenttsvfile; 
     $MDLsampleid =~ s/\-oncogene.tsv//g; 
     print "The currentVCFis############# " . $currenttsvfile . "\n"; 

     my @SampleID =(); 
     @SampleID = split /\//, $MDLsampleid; 
     print "The sampleIDis##############" . $SampleID[4] . "\n"; 

     my $CNVdata = $currenttsvfile; 
     $CNVdata =~ s/\.tsv/_cnv\.txt/g; 

     my $FinalCNVdata = $currenttsvfile; 
     $FinalCNVdata =~ s/\.tsv/_finalcnv\.txt/g; 

     my $cmd2 = `fgrep -v "#" $currenttsvfile > $MDLlinestsvfile`; 
     print "$cmd2"; # this code extracts from the current vcf file all of the 
          # lines of data and outputs them into a separate file 

     my $cmd5 = `grep -vwE "(CNV|intronic|synonymous|utr_3|utr_5)" 
#removes lines that contain CNV/intronic/synonymous/utr_3/utr_5" 

$MDLlinestsvfile > $Variantlinestsvfile`; 
     print "$cmd5"; 

     open(my $fh_in, '<', $Variantlinestsvfile) 
       or die "cannot open $Variantlinestsvfile: $!\n"; 
#removes lines that contain 0/0 and ./. genotypes from field 70. 

     open(my $fh_out, '>', $MDLtsvfile) 
       or die "cannot open $MDLtsvfile: $!\n"; 

     while (my $line = <$fh_in>) { 

      # tab/field-based: 
      my @fields = split(/\s+/, $line); 
      print $fh_out $line unless ($fields[70] =~ m|([0.])/\1|); 
     } 
     close($fh_in); 
     close($fh_out); 

     #open each filtered file with all columns and pushes it into array. 
     open(TXT2, "$MDLtsvfile"); 
     while (<TXT2>) { 
      push(@HotSpotLines, $_); 
     } 
     close(TXT2); 

     foreach (@HotSpotLines) { 
      chop($_); 

      my @HotSpotEntries =(); 
      my $currentMDLline = $_; 
      @HotSpotEntries = split(/\t/, $currentMDLline); 

      my $chr  = $HotSpotEntries[9]; 
      my $position = $HotSpotEntries[10]; 
      my $cosmicids = $HotSpotEntries[21]; 
      my $refforward = $HotSpotEntries[67]; 
      my $genotype = $HotSpotEntries[70]; 
      my $altforward = $HotSpotEntries[77]; 
      my $altreverse = $HotSpotEntries[78]; 
      my $cDNA  = $HotSpotEntries[81]; 
      my $exon  = $HotSpotEntries[83]; 
      my $conseq  = $HotSpotEntries[84]; 
      my $location = $HotSpotEntries[88]; 
      my $geneclass = $HotSpotEntries[92]; 
      my $aachange = $HotSpotEntries[98]; 
      my $transcript = $HotSpotEntries[100]; 

      $currentline 
        = $SampleID[4] . "\t" 
        . $chr . "\t" 
        . $position . "\t" 
        . $cosmicids . "\t" 
        . $refforward . "\t" 
        . $refreverse . "\t" 
        . $genotype . "\t" 
        . $altforward . "\t" 
        . $altreverse . "\t" 
        . $cDNA . "\t" 
        . $exon . "\t" 
        . $conseq . "\t" 
        . $location . "\t" 
        . $geneclass . "\t" 
        . $aachange . "\t" 
        . $transcript; 

      # print "The currentVCFlineis ".$currentline."\n"; 
      push(@currentlines, $currentline); 

     } 

     my $i; 

     for ($i = 0; $i < @currentlines; $i += 1) { 

      my $currentguiline = $currentlines[$i]; 

      my $cmd5 = `echo "$currentguiline" >> $FinalVariants`; 
      print "$cmd5"; 

      #my $cmd9 = `sed -i '1i$SampleID[4]' $FinalVariants`; print $cmd9; 
     } 
    } 
+0

顯示的'@ tsvfiles'內容的'ls'命令後。這段代碼似乎過於複雜,並且很難遵循它似乎在做的事情。 –

+0

'ls'列出tsv文件的路徑並將其輸出到tsv_files.txt中。每個文件只列出一次,因此,我不明白爲什麼它以某種方式遍歷每個文件兩次?我所做的只是列出感興趣的文件的路徑,循環遍歷每個文件,解析並轉儲到FinalVariantfile.txt中。我很開心學習完成相同的另一種方式,但是,由於我是新手,所以我試着說離開comlex正則表達式。 – user3781528

回答

3

沒有必要啓動這麼多新的外殼子流程做這樣的基本操作。 lsfgrepgrepecho都等同於Perl和特別呼籲echo文本的每一行是拷貝一個文件到另一個

我懷疑你的問題是因爲該行的一個非常貧困的方式

my $cmd5 = `echo "$currentguiline" >> $FinalVariants`; 

這將附加@currentlines的每個元素到文件末尾。因此,第一次運行程序時將包含結果的單個副本,但隨後的每次運行只會將更多數據添加到文件的末尾,並且會繼續增長。

我討厭提供黑客獲取事情的工作,但要花費我很長時間才能理解你的程序在所有混淆背後做了什麼,並寫出了一個適當的簡潔版本。您可以通過添加該行

unlink $FinalVariants or die $!; 

之前的foreach (@tsvfiles) { ... }循環來修復它。這將刪除文件並確保爲程序的每次執行創建新版本。



好啦,我已經仔細研究了你的代碼,我認爲這會做你想要什麼。如果沒有任何數據,甚至沒有文件名稱樣本,我一直無法測試它,除非確保它編譯完成,所以如果它第一次運行就會是一個奇蹟,但我相信這是獲得一致解決方案的最佳機會

請注意,您在自己的代碼中使用$refreverse時存在問題,但從未聲明或定義它,因此,您顯示的代碼無法創建您所說的問題,因爲它在編譯過程中因錯誤消息而死亡

Global symbol "$refreverse" requires explicit package name 

我猜測,這$ref_forward之後是在指數68

請報到如何好這個功能

#!/usr/bin/perl 

use strict; 
use warnings 'all'; 

my $home   = "/data"; 
my $tsv_directory = "$home/test_all_runs/$ARGV[0]"; 

my $final_variants = "$tsv_directory/final_variant_file.txt"; 

open my $out_fh, '>', $final_variants 
     or die qq{Unable to open "$final_variants" for output: $!}; 

my @tsv_files = glob "$tsv_directory/FOCUS*.tsv"; 

for my $tsv_file (@tsv_files) { 

    print "The current VCF is ############# $tsv_file\n"; 

    $tsv_file =~ m|([^/]+)-oncogene.tsv$| or die "Cant extract Sample ID"; 
    my $sample_id = $1; 
    print "The sample ID is ############## $sample_id\n"; 

    open my $in_fh, '<', $tsv_file 
      or die qq{Unable to open "$tsv_file" for input: $!}; 

    while (<$in_fh>) { 

     next if /^#/; 
     next if /\b(?:CNV|intronic|synonymous|utr_3|utr_5)\b/; 

     my @fields = split; 
     next if $fields[70] eq '0/0' or $fields[70] eq './.'; 

     my @wanted = (9, 10, 21, 67, 68, 70, 77, 78, 81, 83, 84, 88, 92, 98, 100); 
     my $current_line = join "\t", @fields[@wanted]; 

     print $out_fh $current_line, "\n"; 
    } 
} 
+0

哪部分令人困惑?每次運行代碼時,我都會刪除$ FinalVariants文件。 – user3781528

+0

@ user3781528:主要問題是您使用如此大量的不同變量和臨時文件,以致實際上正在進行混淆。我確信這是非常簡單的事 – Borodin

+0

我不認爲問題是回聲。我試過不同的方法添加它產生了相同的結果。我明天才能考試。感謝您的建議。 – user3781528