2013-03-18 91 views
-2

我有一個位置網格(AI和1-9)的字母數字文本,其在一個平面文件引用(*的.csv)以各種形式,有時包括空格,和隨機的情況下,如: 9-H,@ b 3,e-4,d4,c6,5h,C2,i9,...這是a到i和1到9的任何組合,包括空白,〜和。提取從csv文件

什麼是處理提取這種字母數字字符的好方法?理想情況下,輸出將位於「註釋」前面的另一列或其他文本文件中。我可以閱讀腳本並弄清楚他們做了什麼,但是我還不夠自信地寫下它們。

樣品輸入文件:

Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white) 
<eof> 

希望的輸出(在字母數字格式,無論是在同一個文件,或新的文件)

Record Location Notes 
46651 E4 
46652 C6 
46205 A1 
... 
46169 I9 

即,總是提取後者的字符集。

好的傢伙,「在未初始化值$注意在使用模式匹配(M //)」的錯誤越來越之後,我剛剛就做了嘗試和我取得了部分成功。

# # starts with anything then space or punctuation then letter then number 
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # starts line with letter then number 
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # after punctuation then number 
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # beginning of line with number 
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # empty line or no record of any grid location except "#7 asdfg" format 
} elsif ($note=~ "") { 
    $note = "##"; 

} 

的時間腳本是不是很成功的是,當它遇到的記錄,如99994和99993.

99999 norecordofgridhere -
99997箱#7進入與出發票的陣列。
99996在第7小時下降,而當我在場外發現時,教練在第8小時。
99994箱在上任後4桶在辦公室文件櫃頂級的
99993 6盒

輸出現在是:

99999 ## norecordofgridhere -
99998 ##
99997Ë7方框#7沒有發票進入陣列。
99996當我發現離場時,E8在第7小時下降,並且在第8小時中,
99994 B 4紙箱在上任後4桶
99993 b 6分配6盒在辦公室文件櫃頂級的

應該有99994和99993.#分別在哪裏我失敗了呢?我應該如何解決這個問題?

我認爲,有一個更清潔的方式,喜歡用文字:: CSV_XS,但是,我遇到了草莓perl的毛刺,甚至測試模塊已正確安裝後。所以我回到了主動狀態。

+0

你可以給這個例子輸入所需的輸出? – azhrei 2013-03-19 00:04:26

+0

只是要清楚:你想抓的東西是'e-4','b/c6','5h','b 3','1A','3 F','C2','b2' ,'5b','g6','i9','D-2','9-H'和'd7'? – Dougal 2013-03-19 00:04:41

+0

不僅可以抓取這些文件,還可以將它們列爲文件中每個記錄的字母數字,即E4,C6,B3,A1等等。 – Solutions 2013-03-19 00:23:09

回答

0
... 

my $coord; 
if ($note =~/
    (?&DEL) 

    ((?&ROW) (?&SEP)?+ (?&COL) 
    | (?&COL) (?&SEP)?+ (?&ROW) 
    ) 

    (?&DEL) 

    (?(DEFINE) 
     (?<ROW> [a-hA-H] ) 
     (?<COL> [1-9]  ) 
     (?<SEP> [\s~\@\-]++) 
     (?<DEL>^| \W | \z) 
    ) 
/x) { 
    $coord = $1; 
    (my $row = uc($coord)) =~ s/[^A-H]//g; 
    (my $col = uc($coord)) =~ s/[^1-9]//g; 
    $coord = "$row$col"; 
} 

... 
0

使用Text::CSV_XS解析CSV文件,它快速而準確。

然後構建一個正則表達式來匹配ID。

最後,標準化每個ID。

#!/usr/bin/perl 

use v5.10; 
use strict; 
use warnings; 
use autodie; 

use Text::CSV_XS; 

# Build up the regular expression to look for IDs 
my $Separator_Set = qr{ [- ] }x; 
my $ID_Letters_Set = qr{ [a-i] }xi; 
my $ID_Numbers_Set = qr{ [1-9] }x; 
my $Location_Re = qr{ 
    \b 
    $ID_Letters_Set $Separator_Set? $ID_Numbers_Set | 
    $ID_Numbers_Set $Separator_Set? $ID_Letters_Set 
    \b 
}x; 

# Initialize Text::CSV_XS and tell it this is a tab separated CSV 
my $csv = Text::CSV_XS->new({ 
    sep_char => "\t", # tab separated fields 
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag(); 

# Read in and discard the CSV header line. 
my $headers = $csv->getline(*DATA); 

# Output our own header line  
say "Record\tLocation\tNotes"; 

# Read each CSV row, extract and normalize the ID, and output a new row. 
while(my $row = $csv->getline(*DATA)) { 
    my($record, $notes) = @$row; 

    # Extract and normalize the ID 
    my($id) = $notes =~ /($Location_Re)/; 
    $id = normalize_id($id); 

    # Output a new row 
    printf "%d\t%s\t%s\n", $record, $id, $notes; 
} 


sub normalize_id { 
    my $id = shift; 

    # Return empty string if we were passed in a blank 
    return '' if !defined $id or !length $id or $id !~ /\S/; 

    my($letter) = $id =~ /($ID_Letters_Set)/; 
    my($number) = $id =~ /($ID_Numbers_Set)/; 

    return uc($letter).$number; 
} 

__END__ 
Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white)