2013-12-18 30 views
2

分塊數據的元素我有一個看起來像這樣的數據:獲取使用Perl

some info 
some info 

[Term] 
id: GO:0000001 
name: mitochondrion inheritance 
namespace: biological_process 
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cy 
synonym: "mitochondrial inheritance" EXACT [] 
is_a: GO:0048308 ! organelle inheritance 
is_a: GO:0048311 ! mitochondrion distribution 

[Term] 
id: GO:0000002 
name: mitochondrial genome maintenance 
namespace: biological_process 
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw] 
is_a: GO:0007005 ! mitochondrion organization 

[Typedef] 
id: regulates 
name: regulates 
xref: RO:0002211 
transitive_over: part_of ! part_of 

注意,該文件的末尾包含空格。

我想要做的就是解析每個以[Term]開頭的塊並獲得id,namenamespace。在這一天結束時,像這樣的陣列散列:

$VAR = ['GO:0000001' => ["mitochondrion inheritance","biological_process"], 
     'GO:0000002' => ["mitochondrial genome maintenance","biological_process"]; 

我該如何去做Perl?

我堅持用這個代碼:

#!/usr/bin/perl 
use Data::Dumper; 
my %bighash; 
while(<DATA>) { 
    chomp; 
    my $line = $_; 

    my $term = ""; 
    my $id = ""; 
    my $name =""; 
    my $namespace =""; 
    if ($line =~ /^\[Term/) { 
    $term = $line; 
    } 
    elsif ($line =~ /^id: (.*)/) { 
    $id = $1; 
    } 
    elsif ($line =~ /^name: (.*)/) { 
    $name = $1; 
    } 
    elsif ($line =~ /^namespace: (.*)/) { 
    $namespace = $1; 
    } 
    elsif ($line =~ /$/) { 
    $bighash{$id}{$name} = $namespace; 
    } 

} 

print Dumper \%bighash; 



__DATA__ 
some info 
some info 

[Term] 
id: GO:0000001 
name: mitochondrion inheritance 
namespace: biological_process 
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cy 
synonym: "mitochondrial inheritance" EXACT [] 
is_a: GO:0048308 ! organelle inheritance 
is_a: GO:0048311 ! mitochondrion distribution 

[Term] 
id: GO:0000002 
name: mitochondrial genome maintenance 
namespace: biological_process 
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw] 
is_a: GO:0007005 ! mitochondrion organization 

[Typedef] 
id: regulates 
name: regulates 
xref: RO:0002211 
transitive_over: part_of ! part_of 

測試在這裏:https://eval.in/80497

回答

5

如果設置Perl的輸入記錄分隔符''local $/ = '';),你會在模式,即由一個空行分隔塊讀取數據。接下來,您可以使用正則表達式從該塊中捕獲您需要的部分。例如:

use strict; 
use warnings; 
use Data::Dumper; 

local $/ = ''; 
my %hash; 

while (<DATA>) { 
    next unless /^\[Term\]/; 

    my ($id)  = /id:\s+(.+)/; 
    my ($name)  = /name:\s+(.+)/; 
    my ($namespace) = /namespace:\s+(.+)/; 

    push @{ $hash{$id} }, ($name, $namespace); 
} 

print Dumper \%hash; 

__DATA__ 
[Term] 
id: GO:0000001 
name: mitochondrion inheritance 
namespace: biological_process 
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cy 
synonym: "mitochondrial inheritance" EXACT [] 
is_a: GO:0048308 ! organelle inheritance 
is_a: GO:0048311 ! mitochondrion distribution 

[Term] 
id: GO:0000002 
name: mitochondrial genome maintenance 
namespace: biological_process 
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw] 
is_a: GO:0007005 ! mitochondrion organization 

[Typedef] 
id: regulates 
name: regulates 
xref: RO:0002211 
transitive_over: part_of ! part_of 

輸出:

$VAR1 = { 
      'GO:0000001' => [ 
          'mitochondrion inheritance', 
          'biological_process' 
          ], 
      'GO:0000002' => [ 
          'mitochondrial genome maintenance', 
          'biological_process' 
          ] 
     }; 

希望這有助於!

3

這裏是一個不錯的技巧,可以幫助。 Perl有一個$/變量,它定義了「輸入記錄分隔符」 - 當您讀取一個輸入記錄<DATA>時,它將讀取直到它遇到$/設置爲的任何值,然後返回所有數據。

通常$/被設置爲換行符,因此<DATA>從文件一次返回一行。但是,如果你把它設置爲空字符串"",然後每次讀取將拆分爲返回所有的數據,直到下一個空行或一系列的空行

$/ = ""; 
while (<DATA>) { 
    chomp;  # remove the trailing newlines 
    # $_ now contains a whole blank-line-separated chunk 
    if (/^\[Term\]/) { 
     ... 
     # parse the [Term] chunk here 
     ... 
    } 
} 

在循環中,你可以解析塊行,然後分割:字符串上的每一行以獲取鍵和值。此時,您可以將該塊的密鑰和值放入您喜歡的任何類型的結構中。