2014-03-07 83 views
0

我有兩個文件:.bedGraph和.bed。 .bedGraph包含cooordinates +強度值(chr,開始,結束,強度),而.bed文件只有座標(chr,start,end)。平均列內容基於另一個文件的範圍

牀銼是通過將距離最遠1000bp的座標拉在一起而製成的。這減少了從牀圖上約66萬的讀數到約30萬。

所以,我bedGraph這個樣子

chr1 10037 10038 0.413963 
chr1 10393 10428 0.827926 
chr1 10540 10546 0.413963 
chr1 10610 10615 0.413963 
chr1 11281 11282 0.413963 

和我的牀看起來像這樣

chr1 10037 56175 
chr1 57265 58983 
chr1 60022 64415 
chr1 65485 74471 
chr1 76305 177390 
chr1 227433 267689 
chr1 317665 384576 
chr1 386108 417753 
chr1 420243 423692 
chr1 425613 426755 

所以我想現在要做的是對具有牀上圖形中添加一列讀取的平均強度(取自.bedGraph文件)落在該區域內,即

.bedGraph 
chr 1 10 1.23413 | 
chr 11 18 0.234  | this <<---------- 
chr 20 24 4.231  |     | 
chr 57 100 2.123413 |     | 
chr 101 123 2.333      | 
              | 
      I want to add this    | 
       |       | 
       |       | 
       V       | 
.bed          | 
chr 1 100 (average of ------------------ 
chr 110 400 (same as above for another interval) 

所以。 ..我寫了一個腳本到目前爲止,我的想法是,我得到的.bed文件的座標,然後存儲bedGraph文件中對應於該間隔內的數據的所有強度值,然後打印出原始牀+平均強度值......易迄今爲止... 這裏是我的代碼:

#! /usr/bin/perl 
use strict; 
use warnings; 
use List::Util qw(sum); 

############################ 
## call with 
## perl average_intensities.pl IN1.bed IN2.bedGraph > OUT.bedGraph 
############################ 

my ($file1, $file2) = @ARGV; 

if (not defined $file1) { 
    die "Need name INPUT 1 file (bed)\n"; 
} 

if (not defined $file2) { 
    die "Need name INPUT 2 file (bedGraph)\n"; 
} 

#declare stuff for first file 
my @coords1; 
my $chr1; 
my $start1; 
my $end1; 

my @coords2; 
my $chr2; 
my $start2; 
my $end2; 
my $int; 
my @intensity; 
my $av_int; 

print "about to open files\n"; ## <<-- this doesn't even print :(

open (IN1, '<', $file1) or die "Could not open $file1: $! \n"; 
open (IN2, '<' ,$file2) or die "Could not open $file2: $! \n"; 

#parse first file and get teh first coordinates 
while(<IN1>){ 
    chomp $_; 

    @coords1 = split "\t", $_; 
    $chr1 = $coords1[0]; 
    $start1 = $coords1[1]; 
    $end1 = $coords1[2]; 

    #parse second file and get the coordinates + intensities 
    while(<IN2>){ 
     chomp $_; 
     @coords2 = split "\t", $_; 
     $chr2 = $coords2[0]; 
     $start2 = $coords2[1]; 
     $end2 = $coords2[2]; 
     $int = $coords2[3]; 
     if ($chr1 eq $chr2){ 

      # if the coordinates on bedGraph are still < than those on bed save the average intensity 
      if($start1 <= $end2){ 
       push @intensity, $int; 
      } else { 
       if (scalar @intensity >0){ 
        $av_int = sum(@intensity)/(scalar @intensity); 
        print join ("\t", $chr1, $start1, $start2, $av_int),"\n"; 
        @intensity =(); 
        last; 
       } 
      } 
     } else { 
      next; 
     } 
    } 
} 
close(IN1); 
close(IN2); 
然而

,當我嘗試運行它,它告訴我

Use of uninitialized value $start2 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1. 
Use of uninitialized value $start1 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1. 

(...和它去在文件中的所有行),我不明白爲什麼因爲我確實聲明瞭這兩個變量。 我不確定在這一點上導致它的代碼有什麼問題... 任何建議都會很棒! 謝謝:)

########################################## #

更新的代碼BELOW 我糾正代碼由虛己的建議,我稍微修改他的劇本:

open IN1, "$file1" or die "Could not open file: $! \n"; 
open IN2, "$file2" or die "Could not open file: $! \n"; 

my %bedGraphHoA; 


while (<IN1>) { 
    my @cols = split; 
    push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ]; 
} 

close IN1; 

while (<IN2>) { 
    my (@bedGaphLines, @bedGaphVals); 
    my @cols = split; 
    if (exists $bedGraphHoA{ $cols[0] }) { 

     for my $elements (@{ $bedGraphHoA{ $cols[0] } }) { 

      if ($elements->[0] >= $cols[1] and $elements->[1] <= $cols[2]) { 
       push @bedGaphLines, $elements; 
       push @bedGaphVals, $elements->[2]; 
      } 
     } 
     if (scalar @bedGaphVals > 0){ 
      my $mean = (sum @bedGaphVals)/@bedGaphVals; 
      print join("\t", $cols[0],$cols[1], $cols[2], $mean), "\n"; 
     } 

    } 
} 

close IN2; 

我測試了它的真實數據的一個子集,看起來像它的工作原理

回答

1

您有:

@coords1 = split $line1, "\t"; 

當你的意思是:

@coords1 = split "\t", $line1; 

和後來的一樣,在這裏您有:

@coords2 = split $line2, "\t"; 

你的意思是:

@coords2 = split "\t", $line2; 

兩個$start1$start2取它們的值從split s的結果,@coords1@coords2

或許下面將爲你的努力提供一些方向:

use strict; 
use warnings; 
use List::Util qw/sum/; 

my %bedGraphHoA; 

open my $bedGraphFH, '<', 'bedGraph.txt' or die $!; 

while (<$bedGraphFH>) { 
    my @cols = split; 
    push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ]; 
} 

close $bedGraphFH; 

open my $bedFH, '<', 'bed.txt' or die $!; 

while (<$bedFH>) { 
    my (@bedGaphLines, @bedGaphVals); 
    my @cols = split; 
    if (exists $bedGraphHoA{ $cols[0] }) { 
     for my $elements (@{ $bedGraphHoA{ $cols[0] } }) { 
      if ($elements->[0] >= $cols[1] and $elements->[1] <= $cols[2]) { 
       push @bedGaphLines, $elements; 
       push @bedGaphVals, $elements->[2]; 
      } 
     } 

    } 
    my $mean = (sum @bedGaphVals)/@bedGaphVals; 
     print join("\t", $cols[0], @{ $bedGaphLines[$_] }, $mean), "\n" 
      for 0 .. $#bedGaphLines; 
} 

close $bedFH; 

__END__ 

bedGraph.txt: 
chr 1 10 1.23413 
chr 11 18 0.234 
chr 20 24 4.231 
chr 57 100 2.123413 
chr 101 123 2.333 
chr 120 123 7.555 
chr 150 200 1.275 

bed.txt: 
chr 1 100 
chr 110 400 

Output: 
chr 1 10 1.23413 1.95563575 
chr 11 18 0.234 1.95563575 
chr 20 24 4.231 1.95563575 
chr 57 100 2.123413 1.95563575 
chr 120 123 7.555 4.415 
chr 150 200 1.275 4.415 
+0

奧赫......我應該已經看到了.. !!我糾正了代碼(見上文),但現在我遇到了我沒有輸出的問題: - /!我希望看到我的.bed文件的行數相同,但我看不到任何想法?我也會在主帖中寫下來!謝謝:) – TriRook

+0

@Seb - 我提供了一些可能有用的代碼。 – Kenosis

+0

是的,我確實改變了那部分代碼並編輯了我的問題(我使用了$ _而不是$ line),但是它仍然不能成功打印...... – TriRook

相關問題