2014-11-05 106 views
0

我有大量的Pdfs。這些是每月生成的出版物,我希望自動化這些文檔的翻錄和解析以獲取要導入到數據庫的聯繫信息 。使用perl解析段落

假設每個文本塊都有一個START和END標記。我需要在開始標籤後跳過「(Parantheses)」和PARAgraph,然後在PARTNER_COMPANY,「標題」和各種形式的聯繫信息之間抓取「Comapny」,直到下一個END TAG。 聯繫信息的字符串可能會有所不同。有些人可能擁有比其他人更多的信息,但我仍然需要遵循特定標題的統一格式。 對於變體,狀態,國家和郵政編碼可能位於由。分隔的同一行。 \ n的其他變體可能會受到\ n的限制。 當程序到達文件的「Dated」部分時,日期需要被解析爲某種格式(見下文)。 一些文本塊將提供所有這些聯繫信息,而其他塊不會。我想解析,直到結束標記。

樣本數據

START 

Company_1_ANY type of character 

(Parantheses) 

PARAgraph 

DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) 


PARTNER_COMPANY_1 

Title - title_1 

Contact for enquiries: - CONTACT PERSON 

HOMER Simpson 

Telephone: (123) 123-1234 

FAX: (111) 346-0000 

Address: 

P.O. Box 123454, ANYTown, 12345-1234 

STATE, USA 

END 



START 

COMPANY_2_ANY type of character 

(Parantheses) 


PARAGRAPH of random text 

Dated this 5 day of November 2014 - 2014-11-05 

PARTNER_COMPANY_2 

Title - Title_2 






address: 

190 RAndom Avenue, Any town 

STATE_2 12345-0987 

Country - USA 

Contact: 

JOsh E 

Telephone: (234) 111-1111 

END 

CODE

my @name; 

while (<>) { 
    if (/START/gism) { 
    while (<>) { 
     last if /END/; 
     chomp; 
     push @name, $_; 

    } 
    print "\[email protected]\n"; 
    @name =() 
    } 
    else { 
    print ''; 
    } 
} 

我的結果

Company_1_ANY type of character (Parantheses) PARAgraph DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) PARTNER_COMPANY_1 Title - title_1 Contact for enquiries: - CONTACT PERSON HOMER Simpson Telephone: (123) 123-1234 FAX: (111) 346-0000 Address: P.O. Box 123454, ANYTown, 12345-1234 STATE, USA 
COMPANY_2_ANY type of character (Parantheses)  PARAGRAPH of random text Dated this 5 day of November 2014 - 2014-11-05 PARTNER_COMPANY_2 Title - Title_2   address: 190 RAndom Avenue, Any town STATE_2 12345-0987 Country - USA Contact: JOsh E Telephone: (234) 111-1111 

所需的輸出

Company,DATE,PARTNER_COMPANY,Title,CONTACT PERSON,Telephone,FAX,Address,City,STATE,ZIP,Country 

Company_1,2014-11-05,PARTNER_COMPANY_1,title_1,HOMER Simpson,(123) 123-1234,(111) 346-0000,P.O. Box 123454,ANYTown,12345-1234,USA 

COMPANY_2,2014-11-05,PARTNER_COMPANY_2,Title_2,JOsh E,(234) 111-1111,,190 RAndom Avenue,Any town,STATE_2,12345-0987,USA 

我得到我想要開始和結束之間什麼,但我不知道如何界定elemtents在我的陣列。另外,我無法弄清楚如何過濾掉不需要的,即PARAGRAPH。我還想修改分隔符之間的內容。我知道一個模塊可能對此有用,但爲了更好地理解如何創建散列和/或密鑰,有沒有更好的方法?

另外,在DESIRED OUTPUT行中,不考慮給出的換行符。該行應繼續用逗號分隔。這個線程只會讓文本有一定的長度,直到換行。

+0

感謝格式你sa請輸入正確的代碼,就像你的代碼一樣! – 2014-11-05 22:14:39

+0

@sputnick是否有效? – JDE876 2014-11-05 23:04:58

+0

是的。空行是真正的空行?不是一個錯誤的格式輸入? – 2014-11-05 23:07:44

回答

0

以腳本爲基礎,需要更多的工作才能完全滿足您的需求。它將信息存儲在Perl Data Structure (DS)中:一個HASH。處理完成後,你只需要遍歷DS產生想要的輸出:

#!/usr/bin/env perl 

use strict; use warnings; # always put this in your scripts 
use Data::Dumper; # to print the data structure (DS) like in my OUTPUT section 

my $h = []; # $h is a reference to a void ARRAY 
my $witness1 = my $witness2 = 0; # setting the 2 variables with '0' 
my $key = -1; 

# using the magic 'diamond operator <>' to loop through the input file 
while (<DATA>) { 
    next if /^$/; # skip this line if it's a blank line 

    $key++ if /^START/; # iterating $key if the current line begins with 'START' 

    # setting HASH values, $& is the matching part 
    $h->[$key]->{Company} = $& if /^Company_.*/i; 
    $h->[$key]->{Partner_Company} = $& if /^PARTNER_COMPANY.*/i; 
    $h->[$key]->{Title} = $& if /^TITLE\s+-\s+\K.*/i; 

    # if there's 'CONTACT PERSON' string in the current line 
    if (/CONTACT\s+PERSON/) { 
     $witness1 = 1; 
     next; 
    } 

    # witness1 tell us that we still are in the 'CONTACT PERSON' part 
    if ($witness1) { 
     $h->[$key]->{Name} = chomp($_); 
     $witness1 = 0; 
    } 

    $h->[$key]->{Tel} = $& if /^Telephone: \K.*/i; 
    $h->[$key]->{Fax} = $& if /^FAX: \K.*/i; 

    if (/^Address:/i) { 
     $witness2 = 1; 
     next; 
    } 

    # witness2 tell us that we still are in the 'ADDRESS' part 
    if ($witness2 and !/^END/) { 
     $h->[$key]->{Address} .= $_; 
    } 

    if (/^END/) { 
     $witness2 = 0; 
    } 
} 

print Dumper $h; 

__DATA__ 
START 

Company_1_ANY type of character 

(Parantheses) 

PARAgraph 

DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) 


PARTNER_COMPANY_1 

Title - title_1 

Contact for enquiries: - CONTACT PERSON 

HOMER Simpson 

Telephone: (123) 123-1234 

FAX: (111) 346-0000 

Address: 

P.O. Box 123454, ANYTown, 12345-1234 

STATE, USA 

END 



START 

COMPANY_2_ANY type of character 

(Parantheses) 


PARAGRAPH of random text 

Dated this 5 day of November 2014 - 2014-11-05 

PARTNER_COMPANY_2 

Title - Title_2 






address: 

190 RAndom Avenue, Any town 

STATE_2 12345-0987 

Country - USA 

Contact: 

JOsh E 

Telephone: (234) 111-1111 

END 

文件:

瞭解引用,我建議你一些指點:

+0

你可以在你的代碼上添加註釋以便我更好地理解。我在解決這個問題時遇到了一些麻煩。 – JDE876 2014-11-13 22:02:02

+0

POST相應編輯 – 2014-11-13 22:40:54