2014-07-11 117 views
-2

我必須使用Perl解析幾個XML文件並將變量存儲在一個散列中。如果可能的話,我想過濾某些屬性。後來在我的代碼中,我從哈希中提取數據並插入到數據庫中。用Perl解析複雜XML的最佳方式是什麼?

我一直在使用XML::Parser,但我更喜歡解析爲一個散列,而不是處理它遇到的每個標記。有什麼建議麼?

我想跳過任何具有屬性kind="dir"的路徑。我需要作者,日期,msg和路徑的文件類型(文件擴展名)。 <path>標籤可以有任何編號,可以是kind「文件」或「目錄」。還可以有多個<logentry>標籤。

<?xml version="1.0" encoding="UTF-8"?> 
<log> 
    <logentry revision="3989"> 
     <author>cergyl</author> 
     <date>2013-07-19T05:31:01.212620Z</date> 
     <paths> 
      <path action="M" kind="dir">/team.admin/trunk/auth.conf</path> 
     </paths> 
     <path action="M" kind="file">/team.admin/trunk/file.cpp</path> 
     <msg>Whitespace change to verify repository synchronization</msg> 
    </logentry> 
</log> 

my $XML_Parser = XML::Parser->new(
            Handlers => { 
               Start => \&hdl_xml_tag_start, 
               End  => \&hdl_xml_tag_end, 
               Char => \&hdl_xml_nonmarkup_char, 
               Default => \&hdl_xml_default 
               } 
           ); 

# This event is generated when an XML start tag is recognized. Parser is an XML::Parser::Expat instance. 
sub hdl_xml_tag_start 
{ 
    my ($parser, $element, %attributes) = @_; 
    $attributes{ '_str' } = "$element:"; 
    $XML_Attributes_Hash_Ref = \%attributes; 
    return; 
} 

# This event is generated when an XML end tag is recognized. Note that an XML empty tag (<foo/>) generates both a start and an end event. 
sub hdl_xml_tag_end 
{ 
    my ($parser, $element) = @_; 

    #format_message($XML_Attributes_Hash_Ref); 
    format_svn_history($XML_Attributes_Hash_Ref); 
    return; 
} 


# This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. 
# A single non-markup sequence of characters may generate multiple calls to this handler. 
sub hdl_xml_nonmarkup_char 
{ 
    my ($parser, $string) = @_; 
    $XML_Attributes_Hash_Ref->{ '_str' } .= $string; 
    return; 
} 

#This is called for any characters that don't have a registered handler. 
sub hdl_xml_default { return; } 
+1

爲什麼不'XML :: Parser'爲你工作? – friedo

+2

我真的很喜歡XML :: Twig,不僅僅是因爲它可以讓我「清除」內存空間。 – Sobrique

+0

@friedo,我編輯了我的問題。它的工作原理,但我寧願立即把整個事情作爲一個散列。 – Busch

回答

2

隨着你提供這是很難寫一個全面的解決方案,但這裏的有限的信息是什麼,使用XML::Twig處理您顯示的XML數據並顯示所有(一個)path沒有kind屬性的元素等於dir

XML::LibXML也是基於C編碼libxml2

use strict; 
use warnings; 

use XML::Twig; 

my $parser = XML::Twig->new(
    twig_handlers => { 
    path => \&path_handler, 
    } 
); 

$parser->parse(*DATA); 

sub path_handler { 
    my ($twig, $path) = @_; 
    return if $path->att('kind') eq 'dir'; 
    print $path->text, "\n"; 
} 


__DATA__ 
<?xml version="1.0" encoding="UTF-8"?> 
<log> 
    <logentry revision="3989"> 
     <author>cergyl</author> 
     <date>2013-07-19T05:31:01.212620Z</date> 
     <paths> 
      <path action="M" kind="dir">/team.admin/trunk/auth.conf</path> 
     </paths> 
     <path action="M" kind="file">/team.admin/trunk/file.cpp</path> 
     <msg>Whitespace change to verify repository synchronization</msg> 
    </logentry> 
</log> 

輸出

/team.admin/trunk/file.cpp 
0

個人非常高的質量模塊,我喜歡的XML DOM :: ::從XML::DOM分析器。但我使用XML :: Twig來打印它們。

my $xp = XML::DOM::Parser->new(); my $doc = $xp->parse("<xml></xml>"); $doc->dispose(); my $doc = $xp->parsefile("file.xml"); $doc->dispose(); // Pretty Print My poorly formatted xml doc my $xpp = XML::Twig->new(pretty_print => 'indented'); $xpp->parse("<xml></xml>"); $xpp->print();

相關問題