2012-06-26 52 views
0

我是一名Perl新手,遇到另一個需要一些幫助和輸入的生物信息學問題。Perl:在數組元素中搜索模式

問題簡要:

  1. 我有一個文件,其中有超過40,000 獨特 DNA序列。唯一的,我的意思是唯一的序列號。我在文章的最後附上了一部分內容,以幫助您展示它的外觀。

  2. 我需要把分成4份,每份分成3份。所以如果一個特定的序列長度爲999個字符,這3個部分中的每一個都有333個字符。

  3. 我需要通過每個3個各個部件的尋找以下模式:

    $ GPAT = [G] {3,5}; $ npat = [A-Z] {1,25};
    $ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;

  4. 如果$ pattern出現在3部分的第一部分中,如果在3部分中的第2部分出現$ pattern,則增加'beginning'的計數器,增加'middle'的計數器,最後如果出現$ pattern在第3部分中,增加'結束'的計數器。

  5. 打印'開始','中間'和'結束'的計數器,即基本上每個序列的'開始','中間','結束'的總和。

    說第一個序列的值分別是'2','5','3',第二個序列的值是'4','1','6',最後的計數應該是'7,6,9'。

我遇到的問題:

  1. 如果一個特定的序列被分成3個部分,潛在$模式丟失。例如在序列等說:

gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta

分割成3個部分產生以下3個子部分,每個35個字符長度的:

gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta

因此,$圖案分爲前兩部分。反正有說「如果$模式開始在第一部分,並在第二部分結束」,增加「開始」的計數?

## UPDATE ##以下問題已解決由於代碼通過Cupidvogel

建議2.How我把一個序列分爲3個部分,如果它的長度是不被3整除?我試過使用int,但後來的部分是1-2 個字符短。

以下是我迄今爲止編寫的代碼。

它讀入文件,顯示標題名稱和序列,每個序列將被分割的長度,最後將序列分成3部分,它們工作正常,前提是序列長度可以被3整除,對於那些沒有最後的第三部分是1-2個字符。

#Take Filename from user 
print "Please enter file name : "; 
$in =<>; 
chomp $in; 


open (FASTA,"$in") or die ; 
while (<FASTA>) 
{ 
$/=">"; 
@array = split '\n', $_; 
$header=shift @array; # Header of the fasta sequence 
print "\n\nNext sequence: \n"; 
print $header,"\n"; 


$seq= join '', @array; # sequence 
$seq=~s/\s//g; 
$seq=~s/\*//g; 
$seq=~s/>//g; 
print $seq,"\n\n"; 

$num = int(length($seq)/3); 
@arr = unpack("A$num A$num A*",$seq); 
print " New method gives this :", @arr; 
print "\nThe first element is :", $arr[0]; 
print "\nThe second element is :",$arr[1]; 
print "\nThe third element is :",$arr[2] ; 



#The following lines of code were originally written to split... 
#...the sequence into 3 parts, albeit unsuccessfully      
#my $split = (length $seq)/3; 
#print $split,"\n\n"; 

#my $int = int $split; 
#print $int,"\n\n"; 


#my @array2 = $seq =~ /(.{$int})/g; 
#print join (" ", @array2),"\n\n"; 

#print $array2[0],"\n",$array2[1],"\n",$array2[2]; 


} 


exit; 

我一直想我與下面的示例文件至今編寫的代碼:sample.fa

>ABC_123 2 
atgtcgatcgatcggcgggcatgcgcgcgcggatg 
atatatagcgcgcgctatatagcgcgactctacgc 
atgctgctgactagctatagtcgctgactgcgcgt 
gggaaaaagggcccgggccccgttttggggatcta 
ggggatagctgatgctagcatgcatgctgactgca 
>DEF_456 4 
gggatgtcgatgcatggggatgcatcgatgcgggg 
actagctagcgggatgctacgatggggatgatgat 
aatatcgcggcgcatatatgctagtctatatatta 
>GHI_789 1 
atagctgctagtcgatcggcgcgggtatcgatcgg 
ggatcgatcgatcggggatcgatcgggggatcgat 

實際的輸入文件如下所示:

>NR_037701 1 
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca 
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt 
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa 
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg 
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa 
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct 
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct 
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt 
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt 
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga 
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg 
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga 
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta 
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg 
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt 
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat 
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag 
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc 
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc 
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg 
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag 
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc 
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc 
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg 
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc 
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag 
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt 
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc 
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca 
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat 
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact 
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg 
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt 
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag 
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg 
gctccctcttttaaagattttccttccctctttccaactccctgggtcct 
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat 
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca 
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc 
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg 
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca 
cacagactcaaaccctctctcacacacatacacatatacattgttattcc 
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca 
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga 
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat 
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg 
agggttgggacttcaacacagctttttgggggatcataattcaacccatg 
acagccactgagattattatatctccagagaataaatgtgtggagttaaa 
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag 
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt 
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa 
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc 
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat 
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat 
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc 
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt 
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg 
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta 
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg 
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc 
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg 
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga 
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc 
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca 
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg 
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc 
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 
aaa 
>NM_198399 1 
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg 
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc 
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct 
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg 
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt 
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg 
caatagcagaggaaggagggactgagcaggagacggccactccagagaac 
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca 
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag 
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct 
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca 
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag 
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa 
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta 
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt 
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa 
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca 
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc 
actgcatatttttacccttatttttgctccttacagcaagattagtaggt 
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact 
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta 
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg 
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa 
attgattaagatatcattatttttgtttggtttggttttgcttttttcct 
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat 
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg 
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc 
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc 
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt 
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat 
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa 
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag 
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca 
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga 
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa 
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc 
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag 
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga 
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt 
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg 
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag 
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg 
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga 
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt 
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt 
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg 
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta 
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa 
aaaaaaaaaaaaa 
>NR_026816 1 
caacccactctctgtgctatgacttcattactctttcccagcccagccct 
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt 
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc 
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc 
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa 
ccagagccaaggctacagctagagagttgactcctctatttgagattgac 
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag 
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag 
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag 
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg 
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg 
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa 
>NR_027917 1 
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag 
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca 
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg 
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga 
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa 
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc 
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa 
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga 
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt 
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca 
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc 
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat 
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca 
ggtgtag 
>NR_002777 3 
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc 
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat 
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg 
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg 
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg 
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt 
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga 
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga 
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg 
agcctaatagagaacataaaattctaaaagataaagataataataatgat 
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg 
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca 
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga 
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa 
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa 
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct 
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga 
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga 
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa 
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa 
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc 
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc 
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca 
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt 
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa 
actcataggacataaaaaaaaaaaaaa 
>NR_033769 1 
ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc 
agctcagccagccacagaactggaatttttcaggagcagggggagcatgg 
agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt 
aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc 
ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg 
atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta 
cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga 
acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga 
ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa 
tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca 
gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag 
tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt 
gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc 
cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg 
tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt 
ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca 
cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat 
ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg 
tggagctggtgcctccagagagccctttgatccagctcttcttggagaga 
gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt 
tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc 
tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca 
aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc 
aagttaggtttatagagtttgactagttttttcgattagatttgtattag 
ttataaatttgttcatagagtttgactaattttttcgattagatttgtat 
ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga 
ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa 
taacagctattgttttgcatatccactgcaggccaagcactttcagcatc 
atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg 
gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta 
tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc 
atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa 
aaagaaaagaaaaacagagacggtc 
>NM_016326 3 
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc 
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg 
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac 
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag 
tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag 
cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc 
aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt 
tttgtaattttatattactttttagtttgatactaagtattaaacatatt 
tctgtattcttccacatattttctgcagttattttaactcagtataggag 
ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta 
acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg 
gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga 
atctctttactgcctggctggccggcagctccg 
>NM_181641 2 
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc 
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg 
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac 
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag 
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta 
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact 
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag 
acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac 
ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag 
cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa 
aaaagaagttttgtaattttatattactttttagtttgatactaagtatt 
aaacatatttctgtattcttccacatattttctgcagttattttaactca 
gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc 
actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc 
ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg 
gatctttgaatctctttactgcctggctggccggcagctccg 
>NM_001144931 1 
gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt 
ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg 
ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg 
ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct 
gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc 
acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag 
gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc 
agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc 
atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg 
ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa 
cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact 
ttgggaggccgag 
>NR_029429 1 
ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat 
ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa 
caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct 
atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac 
actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag 
tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc 
tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg 
caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta 
cttcacctgccagccaacccgccagtattgtggagagctcatccttggag 
gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc 
cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag 
gccactggcttgtgctctgagggttgccaggccattgtggataccgagac 
cttcctgc 
>NR_026551 1 
tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc 
taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta 
ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc 
gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg 
gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg 
acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg 
gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg 
cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg 
gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt 
gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa 
agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag 
aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc 
cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga 
gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag 
gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc 
tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg 
cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc 
cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct 
gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct 
gagcagagccctcactcccaggcagagttgtctgaatccttcct 
>NM_181640 2 
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc 
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg 
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac 
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag 
tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg 
taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa 
accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt 
atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc 
ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg 
taattttatattactttttagtttgatactaagtattaaacatatttctg 
tattcttccacatattttctgcagttattttaactcagtataggagctag 
aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt 
ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct 
gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct 
ctttactgcctggctggccggcagctccg 
>NM_016951 3 
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc 
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg 
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac 
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag 
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta 
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact 
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag 
acttgatcgattaatgaagtggttattttggcctttgcttgatattatca 
actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg 
ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt 
gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc 
tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa 
gaagttttgtaattttatattactttttagtttgatactaagtattaaac 
atatttctgtattcttccacatattttctgcagttattttaactcagtat 
aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta 
ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca 
cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc 
tttgaatctctttactgcctggctggccggcagctccg 
>NR_002773 1 
cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca 
ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa 
ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta 
ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct 
gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca 
ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga 
gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc 
tttctgacccagcagctggggccagggctggtggatgcagcccaggccca 
gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg 
ctgcagccctggctcacttggacagggggagccccccacctgcccgggag 
gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga 
gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg 
tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc 
caagagtacctggacatagaccagatgatcttcgacagagagctgcccca 
ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga 
acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg 
gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct 
gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg 
cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc 
ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct 
gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc 
ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc 
agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg 
cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg 
agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga 
ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg 
cttgggccacttctccacgcccctgacccatggggtggactgcccctacc 
tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag 
acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct 
gcggcgacaccactcagatctctactcccactactttgggggccttgcgg 
aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat 
gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca 
caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt 
atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc 
gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag 
gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag 
ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg 
ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct 
ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg 
gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat 
cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg 
cactccatcctgcgtgactgaac 
>NR_037806 1 
attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg 
ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt 
ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac 
cctccagtgttggcacaggcccacccctggctccaccagagccagaagca 
gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc 
catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag 
agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag 
ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca 
gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg 
gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt 
gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag 
gattaaaatcacactaataacccctggatggtcaatctgataataggatc 
agatttacgtctaccctaattcttaacattgcagctttctctccatctgc 
agattattcccagtctcccagtaacacgtttctacccagatcctttttca 
tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag 
aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg 
ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa 
gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct 
tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc 
tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg 
aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa 
ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg 
ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc 
ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag 
tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc 
aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg 
agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa 
agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa 
gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt 
attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa 
atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt 
gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg 
gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg 
ggattagctctg 

任何幫助和意見將深表謝意。

感謝您抽出寶貴的時間去通過我的問題!

+0

你確定你不想要'我的$ npat ='[A-FH-Z] {1,25}''?即'G's之間的序列是否也包含'G's?這兩種選擇給出了非常不同的結果。 – Borodin

+0

@Borodin再次您好,先生!實際上,G之間的序列也可以包含1-2個Gs。所以'我的$ npat'確實是'[A-Z],{1,25}'。 $模式是我們的DNA和RNA序列中發現的一個非常特殊的基序,最近發現它對調節基因表達等基本生物現象具有重要意義。 – Neal

回答

2

而不是將序列分成三部分,我看到這個工作的方式是在完整序列中找到所有出現的$pattern,並確定模式從哪個開始。

內置變量$-[0]包含最近成功匹配開始的偏移量。

下面的代碼做我認爲你想要的。它通過累積每個序列(當發現新的序列ID或到達文件末尾時結束)並將其傳遞給process_seq子例程來工作。

該子程序獲取該序列的長度,並對該字符串每個三分之一末尾的偏移進行歸納。慣用的sprintf '%.0f', $value用於將小數值舍入到最近的字符位置。

@counts陣列在序列中每次出現$regex時進行調整。通過將$-[0]中匹配的起始位置與序列的三個片段中的每一個片段的結束偏移進行比較來建立要增加的@counts的元素。

一旦處理了每個序列,@counts中的值累積到@totals中,以給出所有序列的整體數字。

顯示使用樣品數據時程序的輸出。總計爲(9, 1, 6)

use strict; 
use warnings; 

my $gpat = '[G]{3,5}'; 
my $npat = '[A-Z]{1,25}'; 
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
my $regex = qr/$pattern/i; 

open my $fh, '<', 'sequences.txt' or die $!; 

my ($id, $seq); 
my @totals = (0, 0, 0); 

while (<$fh>) { 

    chomp; 

    if (/^>(\w+)/) { 
    process_seq($seq) if $id; 
    $id = $1; 
    $seq = ''; 
    print "$id\n"; 
    } 
    elsif ($id) { 
    $seq .= $_; 
    process_seq($seq) if eof; 
    } 
} 

print "Total: @totals\n"; 



sub process_seq { 

    my $sequence = shift; 
    my $length = length $sequence; 

    my @offsets = map {sprintf '%.0f', $length * $_/3} 1..3; 

    my @counts = (0, 0, 0); 

    while ($sequence =~ /$regex/g) { 
    my $place = $-[0]; 
    for my $i (0..2) { 
     next if $place >= $offsets[$i]; 
     $counts[$i]++; 
     last; 
    } 
    } 

    print "@counts\n\n"; 
    $totals[$_] += $counts[$_] for 0..2; 
} 

輸出

NR_037701 
0 0 1 

NM_198399 
1 0 0 

NR_026816 
1 0 1 

NR_027917 
0 0 0 

NR_002777 
0 0 0 

NR_033769 
1 0 0 

NM_016326 
1 0 1 

NM_181641 
1 0 1 

NM_001144931 
0 0 0 

NR_029429 
0 1 0 

NR_026551 
1 0 0 

NM_181640 
1 0 1 

NM_016951 
1 0 1 

NR_002773 
1 0 0 

NR_037806 
0 0 0 

Total: 9 1 6 
+0

您好,先生!我非常驚訝於您可以將任務解決爲流暢的邏輯算法。你的方法完全消除了$ pattern分裂的問題。我真的很想能夠像你先生一樣編程!我確實有幾個疑問,我很快會發表其他意見,以防我無法自行解決。 – Neal

+0

鮑羅廷,你能幫我理解這行代碼:'* $ _/3} 1..3;'。我知道,正如你所提到的那樣,它正在將字符串的每一個三分之一末尾的偏移量,但它究竟是如何工作的? – Neal

+0

另外,如果$ place> = $ offset [$ i];'Say $ place = 100和$ offset [$ i] = 1000,那麼我在這行代碼'next下有點困惑,那麼我們有條件' $ place <$ offset [$ i]'不是嗎?在這種情況下會發生什麼?提前謝謝了! – Neal

2

我解除鮑羅廷的process_seq功能,但使用生物:用法類似於SeqIO在由序列中的文件序列閱讀,在由行手動讀取線和邏輯的優點,以確定各個處理。我相信,這些優點是:已經開發並通過許多其他測試

  • 代碼
  • 只要有可能,如果輸出是通過生物做::用法類似於SeqIO模塊,結果文件然後可以使用生物閱讀: :SeqIO讀取(next_seq)方法。
  • 我想不出現在:-)

其他原因我想象的生物遺傳代碼模塊的BioPerl包必須是壓倒性的生物學家開始編程。他可能不願意嘗試挖掘他開始製作程序所需的信息。 BioPerl wiki是一個很好的開始,尤其是Howto部分,然後有一個如何爲初學者和其他人。你會發現大多數(?)有用的代碼示例。 Bio::Seq在一開始就有一些很好的代碼示例,並且是大多數常規序列函數的地方。而且,對於輸入/輸出,使用Bio::SeqIO模塊,它在手冊開始處有示例。

#!/usr/bin/perl 
use strict; 
use warnings; 
use Bio::SeqIO; 

my $gpat = '[G]{3,5}'; 
my $npat = '[A-Z]{1,25}'; 
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
my $regex = qr/$pattern/i; 

my $in = Bio::SeqIO->new (-file => "fasta_dat.txt", 
          -format => 'fasta'); 
my @totals; 
while (my $seq = $in->next_seq()) { 
    process($seq); 
} 

print "Totals: "; 
print "@totals\n"; 

sub process { 
    my $seq = shift; 
    my @offset = map {sprintf '%.0f', $seq->length * $_/3} 1..3; 
    my $sequence = $seq->seq; 

    my @count = (0,0,0); 
    while ($sequence =~ /$regex/g) { 
     my $place = $-[0]; 
     for my $i (0 .. 2) { 
      next if $place >= $offset[$i]; 
      $count[$i]++; 
      last; 
     } 
    } 
    print $seq->id, "\[email protected]\n"; 
    $totals[$_] += $count[$_] for 0 .. $#count; 
} 
+0

你好Chris,非常感謝你的回答!事實上,我對編程和Perl完全陌生。我實際上在生物信息學實習了2個月,並被提到了「開始用於生物信息學的Perl」。本書的範圍並不包括這些問題。所以是的,不僅許多變量和功能都是新的,BioPerl模塊的使用也不是很熟悉。這讓我意識到Perl和編程有多麼驚人,還有多少我還沒有學習...也許2個月是不夠的。我每天都在學習新的東西! – Neal

+0

您也可以在Perl Monks上搜索此網站或[http://perlmonks.com/?node=Super%20Search]。搜索'fasta'或'Bio :: SeqIO'。 –

+0

許多非常感謝克里斯! :) – Neal