Perl數據解析和插入行

我有一個這樣的文件來解析，並希望使......

從第一列中，ID，外顯子信息開始，開始位置，結束位置和方向。當ID遇到一個號碼時，ID增加1。

1 9239 712 8571 + 
1 start_codon 712 714 + 
1 stop_codon 8569 8571 + 
2 3882 24137 24264 + 
2 start_codon 24137 24139 + 
3 3882 24322 24391 + 
4 3882 24490 26064 + 
4 stop_codon 26062 26064 + 
5 4972 26704 26740 + 
5 start_codon 26704 26706 + 
6 4972 26814 27170 + 
7 4972 27257 27978 + 
7 stop_codon 27976 27978 + 
8 10048 40161 41114 - 
8 start_codon 41112 41114 - 
8 stop_codon 40161 40163 - 
9 272 43167 43629 - 
9 stop_codon 43167 43169 - 
10 272 43755 44059 - 
10 start_codon 44057 44059 -

這樣....

1 9239 *712* *8571* + 
1 start_codon 712 714 + 
1 stop_codon 8569 8571 + 
*X 9239 712 8571 +* 
2 3882 *24137* 24264 + 
2 start_codon 24137 24139 + 
3 3882 24322 24391 + 
4 3882 24490 *26064* + 
4 stop_codon 26062 26064 + 
*X 3882 24173 26064 +* 
5 4972 *26704* 26740 + 
5 start_codon 26704 26706 + 
6 4972 26814 27170 + 
7 4972 27257 *27978* + 
7 stop_codon 27976 27978 + 
*X 4972 26704 27978 +* 
8 10048 *40161* *41114* - 
8 start_codon 41112 41114 - 
8 stop_codon 40161 40163 - 
*X 10048 40161 41114 -* 
9 272 *43167* 43629 - 
9 stop_codon 43167 43169 - 
10 272 43755 *44059* - 
10 start_codon 44057 44059 - 
*X 272 43167 44059 -*

每行開頭X已被添加，但用我的技術我不能...... :(

的事情是每一個外顯子第二列中的數字忽略了「start_codon」和「end_codon」，必須得到最小編號外顯子位置和最大編號外顯子位置之間星號*。

這是我的基本代碼來解析t他的數據...但我想，必須從頭重新編碼（我不知道如何插入行'X'）

（對不起，我刪除了代碼，因爲它不夠好並可能會造成混淆......）

Perl大師在世界上，你能幫我嗎？

謝謝！

AS TLP aked我把我的代碼放回去了。其尷尬的代碼雖然

use strict; 

if (@ARGV != 1) { 
    print "Invalid arguments\n"; 
    print "Usage: perl min_max.pl [exon_output_file]\n"; 
    exit(0); 
} 

my $FILENAME = $ARGV[0]; 
    my $exonid = 0; 
    my $exon = ""; 
    my $startpos = 0; 
    my $endpos = 0; 
    my $strand = ""; 
    my $min_pos = 0; 
    my $max_pos = 0; 

open (DATA, $FILENAME); 

while (my $line = <DATA>) { 
    chomp $line; 

    if ($line ne "") { 
     if ($line =~ /^(.+)\t(.+)\t(.+)\t(.+)\t(.+)/) { 
     $exonid = $1; 
     $exon = $2; 
     $startpos = $3; 
     $endpos = $4; 
     $strand = $5; 
     } 
     if ($exon =~ /\d+/) { 
      print $exonid,"\t",$exon,"\t",$startpos,"\t",$endpos,"\t",$strand,"\n"; 
     } else { 
      print $exonid,"\t",$exon,"\t",$startpos,"\t",$endpos,"\t",$strand,"\n"; 
     } 
    } 
} 

close (DATA); 
exit;

我如何比較的最大值和最低值....

來源

2012-12-26 Karyo

你沒有試圖存儲最小值/最大值的任何地方。這怎麼可能工作？ – TLP

確切地說，我正在嘗試，但我不能，所以我只是把最基本的代碼開始，我刪除了所有不工作的部分。我仍在努力。這就是爲什麼上面的代碼非常簡單。感謝名單。 :( – Karyo

爲什麼'* X 10048 40161 41114 - *'不是'* X 10048 41112 41114 - *'8之後？ –

基本上，你要做的是通過這些線，跳過你不想要的那個（即第2列中沒有數字），記住同一組中每個新行的最小/最大值，以及當第2列數字改變你打印並重新開始。有了這個解決方案，你還必須在最後手動打印最後一組。

此代碼使用內部DATA文件句柄來演示數據。簡單地改變<DATA>到<>對目標輸入文件使用像這樣：perl script.pl inputfile

use strict; 
use warnings; 
use List::Util qw(min max); 

my $print; 
my ($min, $max, $id); 
while (<DATA>) {     ###### change to <> to run on input file 
    my @line = split; 
    if ($line[1] !~ /^\d+$/) {    # if non-numbers in col 2 
     print;        # print line 
     next;        # skip to next line 
    } 
    if (!defined($id) or $id != $line[1]) { # New dataset! 
     say $print if $print;     # Print and reset 
     $id = $line[1]; 
     $min = $max = undef; 
    } 
    $min = min($min //(), @line[2,3]);  # find min/max, skip undef 
    $max = max($max //(), @line[2,3]); 
    $print = join "\t", "X", $line[1], $min, $max; # buffer the print 
} 
print $print; 

__DATA__ 
1 9239 712 8571 + 
1 start_codon 712 714 + 
1 stop_codon 8569 8571 + 
2 3882 24137 24264 + 
2 start_codon 24137 24139 + 
3 3882 24322 24391 + 
4 3882 24490 26064 + 
4 stop_codon 26062 26064 + 
5 4972 26704 26740 + 
5 start_codon 26704 26706 + 
6 4972 26814 27170 + 
7 4972 27257 27978 + 
7 stop_codon 27976 27978 + 
8 10048 40161 41114 - 
8 start_codon 41112 41114 - 
8 stop_codon 40161 40163 - 
9 272 43167 43629 - 
9 stop_codon 43167 43169 - 
10 272 43755 44059 - 
10 start_codon 44057 44059 -

輸出：

9239 712  8571 
3882 24137 26064 
4972 26704 27978 
10048 40161 41114 
272  43167 44059

來源

2012-12-26 13:09:56 TLP

Thanx TLP的幫助，但我想要的輸出就像我的問題中的第二個框。但是閱讀代碼會幫助我更多地學習perl。所以我很高興，並感謝您的幫助和時間。 – Karyo

@Karyo這只是在'next'語句之前添加'print'語句，然後重新格式化'$ print'變量的問題。除非你還想要最小/最大數字周圍的'*'符號。 – TLP

由於TLP，你的一個更普遍的作品！但在下一個陳述之前添加印刷聲明意味着什麼？對不起，我不明白:( – Karyo

如果我理解你的權利，這裏有一個方法（未經測試！）做你彷彿是要求：

use strict; 
use warnings; 
use feature 'say'; 

# read first line, initialize accumulators, print it back 
chomp($_ = <>); 
my ($last_id, $last_exon, $min_start, $max_end, $last_strand) = split /\t/; 
say $_; 

# loop over remaining lines 
while (<>) { 
    chomp; 
    my ($exonid, $exon, $startpos, $endpos, $strand) = split /\t/; 

    if ($exon !~ /\D/ and $exon != $last_exon) { 
     # new exon found, print summary of last one... 
     say join "\t", "X", $last_exon, $min_start, $max_end, $last_strand; 
     # ...and reset accumulators 
     ($last_id, $last_exon, $min_start, $max_end, $last_strand) 
      = ($exonid, $exon, $startpos, $endpos, $strand); 
    } 
    else { 
     # previous exon continues, just update accumulators 
     $last_id  = $exonid; 
     $last_exon = $exon  if $exon !~ /\D/; 
     $min_start = $startpos if $min_start > $startpos; 
     $max_end  = $endpos if $max_end < $endpos; 
     $last_strand = $strand; # should not really be needed 
    } 
    # ...and don't forget to print the original line back again 
    say $_; 
} 
# end of file, print summary of last exon 
print join("\t", "X", $last_exon, $min_start, $max_end, $last_strand), "\n";

基本上，我假設你想打印一個總結符合X開始，每當你遇到的第二列是從該列中的前一個數字不同的數字，而線在secon中有非數字值d列不應該觸發摘要。此外，您大概也需要在文件末尾添加摘要行。

如果$exon只包含數字，則表達式$exon !~ /\D/返回true。（具體而言，它測試它是否不包含非 -numeric性格，所以一個空字符串會匹配了。）

有邊緣的情況下一堆，我還沒有考慮過，因爲我不知道它們是否可能存在於您的數據中，以及如果它們確實存在，如何處理它們。例如，只需要小心，人們可能希望在不太可能的情況下打印摘要，即在外顯子數保持不變的情況下鏈發生改變。同樣，謹慎的程序員可能想要考慮輸入文件爲空的可能性，或者第二行包含非數字值的第一行。

至少在use warnings的情況下，如果我認爲任何數值始終爲數字，那麼您將會收到警報。

來源

2012-12-26 12:34:10

Thanx llmari Karonen！這段代碼工作的很好，但它不打印出第一行文件「1 9239 712 8571 +「，並從第二行開始。但是，我會盡力完成剩下的工作！ – Karyo

啊，傻了。固定。 –

我試過了，但這次第一行打印兩次。所以一旦我刪除了'說$ _;'從第一個chomp系列，它工作完美！謝謝堆！ – Karyo

Perl數據解析和插入行

回答

相關問題