我試圖從Ensembl FASTA文件中找到蛋白質圖案。我已經完成了大部分腳本,比如檢索序列ID和序列本身,但是我收到了一些有趣的結果。無法從Emsembl FASTA刪除換行符
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $motif1 = qr/(HE(\D)(\D)H(\D{18})E)/x;
my $motif2 = qr/(AMEN)/x;
my $input;
my $output;
my $count_total = 0;
my $count_processed = 0;
my $total_run = 0;
my $id;
my $seq;
my $motif1_count = 0;
my $motif2_count = 0;
my $motifboth_count = 0;
############################################################################################################################
# FILEHANDLING - INPUT/OUTPUT
# User input prompting and handling
print "**********************************************************\n";
print "Question 3\n";
print "**********************************************************\n";
#opens the user input file previously assigned to varible to new variable or kills script.
open my $fh, '<', "chr2.txt" || die "Error! Cannot open file:$!\n";
#Opens and creates output file previously assigned to variable to new variable or kills script
#open(RESULTS, '>', $output)||die "Error! Cannot create output file:$!\n";
# FILE and DATA PROCESSING
############################################################################################################################
while (<$fh>) {
if (/^>(\S+)/) {
$count_total = ++$count_total; # Plus one to count
find_motifs($id, $seq) if $seq; # Passing to subroutine
$id = substr($1, 0, 15); # Taking only the first 16 characters for the id
$seq = '';
}
else {
chomp;
$seq .= $_;
}
}
print "Total proteins: $count_total \n";
print "Proteins with both motifs: $motifboth_count \n";
print "Proteins with motif 1: $motif1_count \n";
print "Proteins with motif 2: $motif2_count \n";
exit;
######################################################################################################################################
# SUBROUTINES
#
# Takes passed variables from special array
# Finds the position of motif within seq
# Checks for motif 1 presence and if found, checks for motif 2. If not found, prints motif 1 results
# If no motif 1, checks for motif 2
sub find_motifs {
my ($id, $seq) = @_;
if ($seq =~ $motif1) {
my $motif_position = index $seq, $1;
my $motif = $1;
if ($seq =~ $motif2) {
$motif1_count = ++$motif1_count;
$motif2_count = ++$motif2_count;
$motifboth_count = ++$motifboth_count;
print "$id, $motif_position, \n$motif \n";
}
else {
$motif1_count = ++$motif1_count;
print "$id, $motif_position,\n $motif\n\n";
}
}
elsif ($seq =~ $motif2) {
$motif2_count = ++$motif2_count;
}
}
正在發生的事情是,如果主題是在一個數據線和下一個的開始結束髮現,它會返回母題與數據的換行符。這種篡改數據的方法之前運行良好。
樣品結果:
ENSG00000119013, 6, HEHGHHKMELPDYRQWKIEGTPLE (CORRECT!)
ENSG00000142327, 123, HEVAHSWFGNAVTNATWEEMWLSE (CORRECT!)
ENSG00000151694, 410, **AECAPNEFGAEHDPDGL**
這就是問題所在。該主題的比賽,但返回上半年,換行符,然後打印下半年在同一行,以及(這是更大的問題的症狀 - 擺脫換行的!)
Total proteins: 13653
Proteins with both motifs: 1
Proteins with motif 1: 12
Proteins with motif 2: 22
我已經嘗試了不同的方法,如@seq =~ s/\r//g
或`s \ \ n // g並在腳本中的不同位置。