2014-03-07 85 views
0

我有一個文件表徵的基因組區域,看起來像這樣:提取重疊區域

chrom chromStart chromEnd PGB 
chr1 12874 28371 2 
chr1 15765 21765 1 
chr1 15795 28371 2 
chr1 18759 24759 1 
chr1 28370 34961 1 
chr3 233278 240325 1 
chr3 239279 440831 2 
chr3 356365 362365 1 

基本上PGB,其特徵爲它的染色體數目(CHROM)的基因組區域的類別,啓動(chromStart)和結束( chromEnd)座標。

我希望以摺疊重疊區域,使得重疊PGB的區域= 1和2是在一個新的類別,PGB = 3輸出端:

chrom chromStart chromEnd PGB 
chr1 12874 15764 2 
chr1 15765 24759 3 
chr1 24760 28369 2 
chr1 28370 28371 3 
chr1 28372 34961 1 
chr3 233278 239278 1 
chr3 239279 240325 3 
chr3 240326 356364 2 
chr3 356365 440831 3 

基本上我希望獲得一個輸出文件,其報告獨特的地區。有兩個標準。

首先,如果PGB(第4列)在行之間相同,則合併範圍。例如。

chrom chromStart chromEnd PGB 
chr1 1 10 1 
chr1 5 15 1 

輸出

chrom chromStart chromEnd PGB 
chr1 1 15 1 

其次,如果PGB是行之間不同,CHR(列1)是相同的,並且範圍重疊(COL2和3)中,報告重疊範圍爲PGB = 3作爲以及各個類別獨有的範圍。

例如。

chrom chromStart chromEnd PGB 
chr1 30 100 1 
chr1 50 150 2 

輸出

chrom chromStart chromEnd PGB 
chr1 30 49 1 
chr1 50 100 3 
chr1 101 150 2 

我希望能說明問題更好。

+0

到目前爲止你有嘗試過什麼嗎? – chilemagic

+0

我對perl/unix非常陌生,所以我在excel上手動執行。不幸的是,我有60000多行,所以我希望能有更快的選擇。 – user3222627

+0

@ user3222627你需要多解釋一下你如何得到你想要的結果。 –

回答

1

我已經創建了一個腳本,我相信可以實現此目標。

use List::Util qw(max); 

use strict; 
use warnings; 

# Skip Header row 
<DATA>; 

# With only 60k rows, going to just slurp all the data 
my %chroms; 
while (<DATA>) { 
    chomp; 
    my ($chrom, $start, $end, $pgb) = split ' '; 

    # Basic Data Validation. 
    warn "chromStart is NaN, '$start'" if $start =~ /\D/; 
    warn "chromEnd is NaN, '$end'" if $end =~ /\D/; 
    warn "range not ordered, '$start' to '$end'" if $start > $end; 
    warn "PGB only can be 1 or 2, '$pgb'" if ($pgb ne '1' && $pgb ne '2'); 

    push @{$chroms{$chrom}{$pgb}}, [$start, $end]; 
} 

print "chrom chromStart chromEnd PGB\n"; 

# Process each Chrom 
for my $chrom (sort keys %chroms) { 
    for my $pgb (sort keys %{$chroms{$chrom}}) { 
     my $ranges = $chroms{$chrom}{$pgb}; 

     # Sort Ranges 
     @$ranges = sort {$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} @$ranges; 

     # Combine overlapping and continguous ranges. 
     # - Note because we're dealing with integer ranges, 1-4 & 5-8 are contiguous 
     for my $i (0..$#$ranges-1) { 
      if ($ranges->[$i][1] >= $ranges->[$i+1][0] - 1) { 
       $ranges->[$i+1][0] = $ranges->[$i][0]; 
       $ranges->[$i+1][1] = max($ranges->[$i][1], $ranges->[$i+1][1]); 
       $ranges->[$i] = undef; 
      } 
     } 
     @$ranges = grep {$_} @$ranges; 
    } 

    # Create pgb=3 for overlaps. 
    # - Save old ranges into aliases, and then start fresh 
    my $pgb1array = $chroms{$chrom}{1}; 
    my $pgb2array = $chroms{$chrom}{2}; 
    my @ranges; 

    # Always working on the first range in each array, until one of the arrays is empty 
    while (@$pgb1array && @$pgb2array) { 
     # Aliases to first element 
     my $pgb1 = $pgb1array->[0]; 
     my $pgb2 = $pgb2array->[0]; 

     # PGB1 < PGB2 
     if ($pgb1->[1] < $pgb2->[0]) { 
      push @ranges, [@{shift @$pgb1array}, 1] 

     # PGB2 < PGB1 
     } elsif ($pgb2->[1] < $pgb1->[0]) { 
      push @ranges, [@{shift @$pgb2array}, 2] 

     # There's overlap for all rest 
     } else { 
      # PGB1start < PGB2start 
      if ($pgb1->[0] < $pgb2->[0]) { 
       push @ranges, [$pgb1->[0], $pgb2->[0]-1, 1]; 
       $pgb1->[0] = $pgb2->[0]; 

      # PGB2start < PGB1start 
      } elsif ($pgb2->[0] < $pgb1->[0]) { 
       push @ranges, [$pgb2->[0], $pgb1->[0]-1, 2]; 
       $pgb2->[0] = $pgb1->[0]; 
      } 
      # (Starts are equal now) 

      # PGB1end < PGB2end 
      if ($pgb1->[1] < $pgb2->[1]) { 
       $pgb2->[0] = $pgb1->[1] + 1; 
       push @ranges, [@{shift @$pgb1array}, 3]; 

      # PGB2end < PGB1end 
      } elsif ($pgb2->[1] < $pgb1->[1]) { 
       $pgb1->[0] = $pgb2->[1] + 1; 
       push @ranges, [@{shift @$pgb2array}, 3]; 

      # PGB2end = PGB1end 
      } else { 
       push @ranges, [@$pgb1, 3]; 
       shift @$pgb1array; 
       shift @$pgb2array; 
      } 
     } 
    } 

    # Append whichever is left over 
    push @ranges, map {$_->[2] = 1; $_} @$pgb1array; 
    push @ranges, map {$_->[2] = 2; $_} @$pgb2array; 

    printf "%-8s %-11s %-11s %s\n", $chrom, @$_ for (@ranges); 
} 

1; 

__DATA__ 
chrom chromStart chromEnd PGB 
chr1 12871 12873 2 
chr1 12874 28371 2 
chr1 15765 21765 1 
chr1 15795 28371 2 
chr1 18759 24759 1 
chr1 28370 34961 1 
chr3 233278 240325 1 
chr3 239279 440831 2 
chr3 356365 362365 1 

而結果:

chrom chromStart chromEnd PGB 
chr1  12871  15764  2 
chr1  15765  24759  3 
chr1  24760  28369  2 
chr1  28370  28371  3 
chr1  28372  34961  1 
chr3  233278  239278  1 
chr3  239279  240325  3 
chr3  240326  356364  2 
chr3  356365  362365  3 
chr3  362366  440831  2 

注:我創造了這個主要是爲鍛鍊自己,但如果你以任何方式使用它,請讓我知道。

+0

爲連續範圍添加了小修正。應該加入1-4和5-8,因爲我們只處理整數範圍。 – Miller