2012-05-31 94 views
3

我是XML :: Twig的新手,我試圖解析PubMed XML 2.0 esummary final來放入mySQL數據庫。我已經得到了這麼多:解析PubMed XML以提交到mySQL數據庫(XML :: Twig)

#!/bin/perl -w 
use strict; 
use DBI; 
use XML::Twig; 

my $uid = ""; 
my $title = ""; 
my $sortpubdate = ""; 
my $sortfirstauthor = ""; 
my $dbh = DBI->connect ("DBI:mysql:medline:localhost:80", 
          "root", "mysql"); 
my $t= new XML::Twig( twig_roots => { 'DocumentSummary' => $uid => \&submit }, 
         twig_handlers => { 'DocumentSummary/Title' => $title, 'DocumentSummary/SortPubDate' => $sortpubdate, 'DocumentSummary/SortFirstAuthor' => $sortfirstauthor}); 
$t->parsefile('20112.xml'); 
$dbh->disconnect(); 
exit; 

sub submit 
    { my $insert= $dbh->prepare("INSERT INTO medline_citation (uid, title, sortpubdate, sortfirstauthor) VALUES (?, ?, ?, ?);"); 
     $insert->bind_param(1, $uid); 
     $insert->bind_param(2, $title); 
     $insert->bind_param(3, $sortpubdate); 
     $insert->bind_param(4, $sortfirstauthor); 
     $insert->execute(); 
     $t->purge; 
    } 

但Perl似乎因爲某些原因失速。我做對了嗎?我試圖使用twig_roots來減少解析量,因爲我只對幾個字段感興趣(這些是大文件)。

這裏是XML的例子:

<DocumentSummary uid="22641317"> 
    <PubDate>2012 Jun 1</PubDate> 
    <EPubDate></EPubDate> 
    <Source>Clin J Oncol Nurs</Source> 
    <Authors> 
     <Author> 
      <Name>Park SH</Name> 
      <AuthType> 
       Author 
      </AuthType> 
      <ClusterID>0</ClusterID> 
     </Author> 
     <Author> 
      <Name>Knobf MT</Name> 
      <AuthType> 
       Author 
      </AuthType> 
      <ClusterID>0</ClusterID> 
     </Author> 
     <Author> 
      <Name>Sutton KM</Name> 
      <AuthType> 
       Author 
      </AuthType> 
      <ClusterID>0</ClusterID> 
     </Author> 
    </Authors> 
    <LastAuthor>Sutton KM</LastAuthor> 
    <Title>Etiology, assessment, and management of aromatase inhibitor-related musculoskeletal symptoms.</Title> 
    <SortTitle>etiology assessment and management of aromatase inhibitor related musculoskeletal symptoms </SortTitle> 
    <Volume>16</Volume> 
    <Issue>3</Issue> 
    <Pages>260-6</Pages> 
    <Lang> 
     <string>eng</string> 
    </Lang> 
    <NlmUniqueID>9705336</NlmUniqueID> 
    <ISSN>1092-1095</ISSN> 
    <ESSN>1538-067X</ESSN> 
    <PubType> 
     <flag>Journal Article</flag> 
    </PubType> 
    <RecordStatus> 
     PubMed - in process 
    </RecordStatus> 
    <PubStatus>4</PubStatus> 
    <ArticleIds> 
     <ArticleId> 
      <IdType>pii</IdType> 
      <IdTypeN>4</IdTypeN> 
      <Value>N1750TW804546361</Value> 
     </ArticleId> 
     <ArticleId> 
      <IdType>doi</IdType> 
      <IdTypeN>3</IdTypeN> 
      <Value>10.1188/12.CJON.260-266</Value> 
     </ArticleId> 
     <ArticleId> 
      <IdType>pubmed</IdType> 
      <IdTypeN>1</IdTypeN> 
      <Value>22641317</Value> 
     </ArticleId> 
     <ArticleId> 
      <IdType>rid</IdType> 
      <IdTypeN>8</IdTypeN> 
      <Value>22641317</Value> 
     </ArticleId> 
     <ArticleId> 
      <IdType>eid</IdType> 
      <IdTypeN>8</IdTypeN> 
      <Value>22641317</Value> 
     </ArticleId> 
    </ArticleIds> 
    <History> 
     <PubMedPubDate> 
      <PubStatus>entrez</PubStatus> 
      <Date>2012/05/30 06:00</Date> 
     </PubMedPubDate> 
     <PubMedPubDate> 
      <PubStatus>pubmed</PubStatus> 
      <Date>2012/05/30 06:00</Date> 
     </PubMedPubDate> 
     <PubMedPubDate> 
      <PubStatus>medline</PubStatus> 
      <Date>2012/05/30 06:00</Date> 
     </PubMedPubDate> 
    </History> 
    <References> 
    </References> 
    <Attributes> 
     <flag>Has Abstract</flag> 
    </Attributes> 
    <PmcRefCount>0</PmcRefCount> 
    <FullJournalName>Clinical journal of oncology nursing</FullJournalName> 
    <ELocationID></ELocationID> 
    <ViewCount>0</ViewCount> 
    <DocType>citation</DocType> 
    <SrcContribList> 
    </SrcContribList> 
    <BookTitle></BookTitle> 
    <Medium></Medium> 
    <Edition></Edition> 
    <PublisherLocation></PublisherLocation> 
    <PublisherName></PublisherName> 
    <SrcDate></SrcDate> 
    <ReportNumber></ReportNumber> 
    <AvailableFromURL></AvailableFromURL> 
    <LocationLabel></LocationLabel> 
    <DocContribList> 
    </DocContribList> 
    <DocDate></DocDate> 
    <BookName></BookName> 
    <Chapter></Chapter> 
    <SortPubDate>2012/06/01 00:00</SortPubDate> 
    <SortFirstAuthor>Park SH</SortFirstAuthor> 
</DocumentSummary> 

謝謝!

回答

0

你的句柄的語法是錯誤的。見documentation的例子:

my $twig=XML::Twig->new( 
    twig_handlers => 
     { title => sub { $_->set_tag('h2') }, # change title tags to h2 
     para => sub { $_->set_tag('p') }, # change para to p 
     hidden => sub { $_->delete;  }, # remove hidden elements 
     list => \&my_list_process,   # process list elements 
     div  => sub { $_[0]->flush;  }, # output and free memory 
     }, 
    pretty_print => 'indented',    # output will be nicely formatted 
    empty_tags => 'html',     # outputs <empty_tag /> 
         ); 
+0

仍然有一些問題,甚至簡化了代碼之後: '我$胸徑= DBI->連接( 「DBI:mysql的:MEDLINE:本地主機:80」, 「root」,「mysql」); my $ t = new XML :: Twig(twig_roots => {'Title'=> \&process,'SortPubDate'=> \&process,'SortFirstAuthor'=> \&process}); $ t-> parsefile('20112.xml'); \t $ dbh-> disconnect(); 退出; 子過程 \t我的($ t,$ elt)= @_; \t my $ column = $ elt-> text; \t \t my $ value = $ elt - > {'att'}; \t \t my $ insert = $ dbh-> prepare(「INSERT INTO medline_citation $ column VALUES $ value;」); \t \t $ insert-> execute(); \t \t $ t-> flush; \t}' – user1428925

1

我會做到這一點的方法是有一個單一的處理程序,對於DocumentSummary,哺養DB,然後清除記錄。沒有必要比這更奇特。

而且,我發現DBIx ::簡單,好了,比原DBI使用更簡單,它負責準備和緩存的語句對我來說:

#!/bin/perl 

use strict; 
use warnings; 

use DBIx::Simple; 
use XML::Twig; 

my $db = DBIx::Simple->connect ("dbi:SQLite:dbname=t.db"); # replace by your DSN 

my $t= XML::Twig->new( twig_roots => { DocumentSummary => \&submit },) 
       ->parsefile('20112.xml'); 

$db->disconnect(); 
exit; 

sub submit 
    { my($t, $summary)= @_; 
     my $insert= $db->query("INSERT INTO medline_citation (uid, title, sortpubdate, sortfirstauthor) VALUES (?, ?, ?, ?);", 
           $summary->att('uid'), 
           map { $summary->field($_) } (qw(Title SortPubDate SortFirstAuthor)) 
          ); 
     $t->purge; 
    } 

如果你想知道關於map { $summary->field($_) } (qw(Title SortPubDate SortFirstAuthor)),它只是寫一個票友(恕我直言更好的可維護性)的方式$summary->field('Title'), $summary->field('SortPubDate'), $summary->field('SortFirstAuthor')