2014-11-21 125 views
0

我給了一個.txt文件中的數據,我需要將它們格式化爲可以上傳到數據庫中的數據。文字以任何方式錨定。根據標籤,數據需要轉儲到特定的txt文件和製表符分隔。在我的生活中,我做了很少的Perl,但是我知道Perl可以很容易地處理這種類型的應用程序,我只是失去了從哪裏開始。在Java,SQL和R之外,我毫無用處。這是一個條目我有接近這1000個處理)的例子:Perl - 將帶有標籤的文本文件解析爲新的文本文件

<PaperTitle>True incidence of all complications following immediate and delayed breast reconstruction.</PaperTitle> 
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p &lt; 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract> 
<BookTitle>Book1</BookTitle> 
<Publisher>Publisher01, Boston</Publisher> 
<Edition>1st</Edition> 
<EditorList> 
    <Editor> 
     <LastName>Lewis</LastName> 
     <ForeName>Philip M</ForeName> 
     <Initials>PM</Initials> 
    </Editor> 
    <Editor> 
     <LastName>Kiffer</LastName> 
     <ForeName>Michael</ForeName> 
     <Initials>M</Initials> 
    </Editor> 
</EditorList> 
<Page>19-28</Page> 
<Year>2008</Year> 
<AuthorList> 
       <Author ValidYN="Y"> 
        <LastName>Sullivan</LastName> 
        <ForeName>Stephen R</ForeName> 
        <Initials>SR</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Fletcher</LastName> 
        <ForeName>Derek R D</ForeName> 
        <Initials>DR</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Isom</LastName> 
        <ForeName>Casey D</ForeName> 
        <Initials>CD</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Isik</LastName> 
        <ForeName>F Frank</ForeName> 
        <Initials>FF</Initials> 
       </Author> 
</AuthorList> 
// 

PaperTitle,摘要和頁面,需要進入Papers.txt文件

PaperTitle,BOOKTITLE ,版,出版商,以及年需要進入Book.txt文件

PaperTitle,所有的編輯數據姓,名,縮寫需要進入Editors.txt

PaperTitle,所有作者信息姓,名,首字母縮寫需要進入Authors.tx t

//標記條目的結尾。所有文件都需要製表符分隔。 雖然我不會拒絕完成的代碼,但我希望至少有一些想法能夠讓我至少解析出其中一個文件(如Book.txt)的代碼的正確方向,我很可能會想到它從那裏出來。 。非常感謝」

+0

我會通過查看使用配置::一般模塊來處理解析和文本:: CSV_XS模塊生成輸出文件開始。 – 2014-11-21 22:57:11

+1

這聽起來像你需要'XML :: Twig'。請顯示這些數據會導致的文件內容。 – Borodin 2014-11-21 22:58:34

回答

-1

請檢查這一個: 使用嚴格的; 使用警告; 使用CWD;

#Get Directory 
my $dir = getcwd(); 

#Grep files from the directory 
opendir(DIR, $dir) || die "Couldn't open/read the $dir: $!"; 
my @AllFiles = grep(/\.txt$/i, readdir(DIR)); 
closedir(DIR); 

#Check files are available 
if(scalar(@AllFiles) ne '') 
{ 
    #Create Text Files as per Requirement 
    open(PAP, ">$dir/Papers.txt") || die "Couldn't able to create the file: $!"; 
    open(BOOK, ">$dir/Book.txt") || die "Couldn't able to create the file: $!"; 
    open(EDT, ">$dir/Editors.txt") || die "Couldn't able to create the file: $!"; 
    open(AUT, ">$dir/Authors.txt") || die "Couldn't able to create the file: $!"; 
} 
else { die "File Not found...$dir\n"; } #Die if not found files 
foreach my $input (@AllFiles) 
{ 
    print "Processing file $input\n"; 
    open(IN, "$dir/$input") || die "Couldn't able to open the file: $!"; 
    local $/; $_=<IN>; my $tmp=$_; 
    close(IN); 
    #Loop from <PaperTitle> to // end slash 
    while($tmp=~m/(<PaperTitle>((?:(?!\/\/).)*)\/\/)/gs) 
    { 
     my $LoopCnt = $1; 
     my ($pptle) = $LoopCnt=~m/<PaperTitle>([^<>]*)<\/PaperTitle>/g; 
     my ($abstr) = $LoopCnt=~m/<Abstract>([^<>]*)<\/Abstract>/gs; 
     my ($pgrng) = $LoopCnt=~m/<Page>([^<>]*)<\/Page>/g; 
     my ($bktle) = $LoopCnt=~m/<BookTitle>([^<>]*)<\/BookTitle>/g; 
     my ($edtns) = $LoopCnt=~m/<Edition>([^<>]*)<\/Edition>/g; 
     my ($publr) = $LoopCnt=~m/<Publisher>([^<>]*)<\/Publisher>/g; 
     my ($years) = $LoopCnt=~m/<Year>([^<>]*)<\/Year>/g; 

     my ($EditorNames, $AuthorNames) = ""; 
     $LoopCnt=~s#<EditorList>((?:(?!<\/EditorList>).)*)</EditorList># 
     my $edtList = $1; my @Edlines = split/\n/, $edtList; 
     my $i ='1'; \#Editor Count to check 
     foreach my $EdsngLine(@Edlines) 
     { 
      if($EdsngLine=~m/<LastName>([^<>]*)<\/LastName>/) 
      { $EditorNames .= $i."".$1."\t"; $i++; } 
      elsif($EdsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/) 
      { $EditorNames .= $1."\t"; } 
      elsif($EdsngLine=~m/<Initials>([^<>]*)<\/Initials>/) 
      { $EditorNames .= $1."\t"; } 
     } 
     #esg; 
     $LoopCnt=~s#<AuthorList>((?:(?!<\/AuthorList>).)*)</AuthorList># 
     my $autList = $1; my @Autlines = split/\n/, $autList; 
     my $j ='1'; \#Author Count to check 
     foreach my $AutsngLine(@Autlines) 
     { 
      if($AutsngLine=~m/<LastName>([^<>]*)<\/LastName>/) 
      { $AuthorNames .= $j."".$1."\t"; $j++; } 
      elsif($AutsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/) 
      { $AuthorNames .= $1."\t"; } 
      elsif($AutsngLine=~m/<Initials>([^<>]*)<\/Initials>/) 
      { $AuthorNames .= $1."\t"; } 
     } 
     #esg; 

     #Print the output in the crossponding text files 
     print PAP "$pptle\t$abstr\t$pgrng\t//\n"; 
     print BOOK "$pptle\t$bktle\t$edtns\t$publr\t$years\t//\n"; 
     print EDT "$pptle\t$EditorNames//\n"; 
     print AUT "$pptle\t$AuthorNames//\n"; 
    } 
} 

print "Process Completed...\n"; 

#Don't forget to close the files 
close(PAP); 
close(BOOK); 
close(EDT); 
close(AUT); 
#End 
+1

使用正則表達式解析XML是沒有任何藉口的。 – Borodin 2014-11-22 17:20:42

+0

@Borodin:我會對使用XML模塊感興趣。你能否完成代碼,然後在我的程序中進一步行動。提前致謝。 – ssr1012 2014-11-23 07:10:48

+0

謝謝@Borodin和ssr1012在這裏的幫助。我應該指定另一件事。我將不得不在許多文件上運行這個腳本(例如:BC_Book,EC_Book,CC_Book等)。共15個文件。我想連接數據,或者每次腳本編譯時添加到文件中,但是這裏每次都創建新文件。我應該能夠自己跟蹤代碼,但我正在懶惰/陷入這個項目的其他方面。額外的幫助在這裏將不勝感激! – BigData 2014-11-25 19:44:22

0

這個例子可以幫助你它使用XML::Twig我建議提取的字段。 Papers.txt輸出文件。記錄分隔符設置爲"//\n",使整個數據塊一次性讀出,且塊進行解析,它被包裹在<Paper>...</Paper>標記之前,使其有效的XML。

use strict; 
use warnings; 
use 5.010; 
use autodie; 

use XML::Twig; 

my $twig = XML::Twig->new; 

open my $fh, '<', 'papers.txt'; 
local $/ = "//\n"; 

while (<$fh>) { 
    $twig->parse("<Paper>\n$_\n</Paper>\n"); 
    my $root = $twig->root; 
    say $root->field($_) for qw/ PaperTitle Abstract Page/; 
    say '---'; 
} 

輸出

True incidence of all complications following immediate and delayed breast reconstruction. 
BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome. 
19-28 
--- 
+0

謝謝@Borodin在這裏尋求幫助。這距離我可以使用代碼實現我自己的完整程序的地方還很遙遠。我仍然理解你在這裏做了什麼,我感謝你的幫助。 – BigData 2014-11-25 19:29:42