2013-05-14 86 views
1

我是perl的新手。仍在學習。根據位置提取fasta序列

我有一個fasta格式的文件。我想提取跨越特定位置的序列。例如,從位置200至300

>Contig[0001] 
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT 
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC 
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG 
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG 
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA 
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT 
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC 
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT 
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA 
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC 
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT 
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT 

我想從序列Contig[0001]提取200-300位置的序列。輸出將是:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATT 
GTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 

我在FASTA文件近500序列和我有包含的ID開始結束製表符分隔的文件所需要的志願服務崗位。

如果有人能幫助我,這將是一件好事。

非常感謝您的幫助。我不確定我可以提供包含職位信息的文件。

新手

+0

歡迎SO。這是一個關於編程的問答網站。請看看[faq#howtoask]。你應該總是提供一些代碼給你的問題,並告訴我們你已經嘗試了什麼,或者你做了什麼努力。我已經回答了這個問題,因爲我覺得它很有趣。 – simbabque

+0

另請參閱類似問題** [這裏](http://stackoverflow.com/questions/16520781/select-bases-between-100-200-and-printing-them-along-with-header)** –

回答

2

我假設你知道如何將它加入到你的程序中。看看substr函數。它做你想做的事。

substr EXPR,OFFSET,LENGTH 

所以基本上你需要做的是:

my $snippet = substr($sequence, 200, 100); 

退一步,看看CPAN:有一個名爲Bio::SeqReader::Fasta,您可以使用模塊讀取文件並獲取序列。這在我看來是很好的記錄,我對此很感興趣。所以這裏有一個解決方案。

use strict; 
use warnings; 
use feature qw(say); 
use Bio::SeqReader::Fasta; 

# Put your positions here (start, end) 
my @positions = (
    [ 200, 300 ], 
    [ 50, 180 ], 
); 

open my $fh, '<', '/path/to/file.fasta' or die $!; 
my $in = new Bio::SeqReader::Fasta(fh => $fh); # create the B::SR::F object 

my $i = 0; 
while (my $so = $in->next()) { 
    # $so represents one dataset and is a Bio::SeqReader::FastaRecord 

    say substr($so->seq(),    # we want a part of the sequence string 
    $positions[$i]->[0],    # starting position 
    $positions[$i]->[1] - $positions[$i]->[0]); # end - start --> length 
} 

close $fh; 
3

單程。的script.pl內容:

#!/usr/bin/env perl 

use warnings; 
use strict; 

my ($adn, $l, $header); 

while (<>) { 
    chomp; 

    ## First line is known, a header, so print it and process next one. 
    if ($. == 1) { 
     printf qq|%s_%s\n|, $_, q|200-300|; 
     next; 
    } 

    ## Concat adn while not found a header. 
    if ('>' ne substr $_, 0, 1) { 
     if (! $l) { $l = length } 
     $adn .= $_; 
     if (! eof) { next } 
    } 
    else { 
     $header = sprintf qq|%s_%s\n|, $_, q|200-300|; 
    } 

    ## Extract range 200-300 and insert newlines to set same length of 
    ## line than before. 
    my $s = substr $adn, 199, 100; 
    $s =~ s/(.{$l})/$1\n/g; 
    printf qq|%s\n|, $s; 
    undef $adn; 

    ## If not end of file, print the header of the following adn. 
    if (! eof) { print $header } 
} 

運行它想:

perl script.pl infile 

能產生:使用Perl和生物::用法類似於SeqIO模塊

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT 
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT 
2

的一種方式。運行像:的./process_fasta.pl

./process_fasta.pl file.fa 200 300 

內容:

#!/usr/bin/perl 

use strict; 
use warnings; 
use Bio::SeqIO; 

my $in_file = $ARGV[0]; 
my $start_pos = $ARGV[1]; 
my $end_pos = $ARGV[2]; 

my $in = Bio::SeqIO->new (-file => $in_file, -format => 'fasta'); 
my $out = Bio::SeqIO->new(-file => ">$in_file.out", -format => 'fasta'); 


while (my $seq = $in->next_seq()) { 

    $seq->display_id($seq->display_id() . "_$start_pos-$end_pos"); 
    $out->write_seq($seq->trunc($start_pos, $end_pos)); 
} 

結果:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT 
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 
1

我讚賞那些誰使用Bio::模塊,我喜歡他們在寫新的東西。不過,也許下面會有所幫助:

use strict; 
use warnings; 

my $end = pop; 
my $start = pop; 
local $/ = '>'; 

while (<>) { 
    chomp; 
    next unless /(Contig.+)/; 
    my ($header) = "$/$1_$start-$end\n"; 
    my $seq = ${^POSTMATCH}; 
    $seq =~ s/\s//g; 
    print $header; 
    print +(substr $seq, $start - 1, $end - $start) . "\n"; 
} 

用法:perl script.pl inFile start end [>outFile]

最後,可選參數輸出定向到一個文件中。

實施例:perl script.pl data.fasta 200 300

輸出上的數據集:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT 

的開始和結束參數pop PED關閉@ARGV,然後將記錄分隔符被設置爲 「>」。由於文件的讀取 - 一次fasta記錄 - 頭部信息被捕獲,將序列保留在${^POSTMATCH}。從序列中刪除所有空白區域。最後,重新格式化的標題爲printed,正如序列中的字符範圍一樣。

希望這會有所幫助!

1

完整的工作輕量級腳本:

#!/usr/bin/env perl 

use strict; 
use warnings; 

my $first=1; 
if ($ARGV[0] eq '-0') 
{ 
    shift @ARGV; 
    $first=0; 
} 

die("Usage: cat <coords> | get_subseqs.pl (-0) <fasta files> > out; where coords is id, start, end. Use -0 if coords start with 0 instead of 1.\n") unless @ARGV; 

# read coords 
my %contigs =(); # id => [ start, end ] 
while (<STDIN>) { 
    chomp; 
    my @row = split(/\t/); 
    die("Require tab-delim: id, start, end\n") unless @row == 3; 
    $contigs{$row[0]} = [ $row[1], $row[2] ]; 
} 

# get subseqs 
my ($id,$seq,$start,$end); 
foreach my $infile (@ARGV) { 
    open(IN, '<', $infile) or die($!); 
    while (<IN>) { 
     if (/>(\S+)/) { 
      my $id2 = $1; 
      print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id; 
      if (exists($contigs{$id2})) { 
       ($id,$seq,$start,$end) = ($id2,'',@{$contigs{$id2}}); 
       delete($contigs{$id2}); 
      } else { $id=undef } 
     } elsif($id) { $seq .= $_ } 
    } 
    close(IN); 
    print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id; 
    $id = undef; 
} 

sub reformat { # add newline every 60 bases 
    my $seq = shift; 
    my $seq2 = ''; 
    while (length($seq) > 60) { 
     $seq2 .= substr($seq,0,60)."\n"; 
     $seq = substr($seq,60); 
    } 
    $seq2 .= $seq."\n"; 
    return $seq2; 
}