提取從csv文件

-2

我有一個位置網格（AI和1-9）的字母數字文本，其在一個平面文件引用（*的.csv）以各種形式，有時包括空格，和隨機的情況下，如： 9-H，@ b 3，e-4，d4，c6，5h，C2，i9，...這是a到i和1到9的任何組合，包括空白，〜和。提取從csv文件

什麼是處理提取這種字母數字字符的好方法？理想情況下，輸出將位於「註釋」前面的另一列或其他文本文件中。我可以閱讀腳本並弄清楚他們做了什麼，但是我還不夠自信地寫下它們。

樣品輸入文件：

Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white) 
<eof>

希望的輸出（在字母數字格式，無論是在同一個文件，或新的文件）

Record Location Notes 
46651 E4 
46652 C6 
46205 A1 
... 
46169 I9

即，總是提取後者的字符集。

好的傢伙，「在未初始化值$注意在使用模式匹配（M //）」的錯誤越來越之後，我剛剛就做了嘗試和我取得了部分成功。

# # starts with anything then space or punctuation then letter then number 
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # starts line with letter then number 
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # after punctuation then number 
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # beginning of line with number 
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # empty line or no record of any grid location except "#7 asdfg" format 
} elsif ($note=~ "") { 
    $note = "##"; 

}

的時間腳本是不是很成功的是，當它遇到的記錄，如99994和99993.

99999 norecordofgridhere -
99997箱＃7進入與出發票的陣列。
99996在第7小時下降，而當我在場外發現時，教練在第8小時。
99994箱在上任後4桶在辦公室文件櫃頂級的
99993 6盒

輸出現在是：

99999 ## norecordofgridhere -
99998 ##
99997Ë7方框＃7沒有發票進入陣列。
99996當我發現離場時，E8在第7小時下降，並且在第8小時中，
99994 B 4紙箱在上任後4桶
99993 b 6分配6盒在辦公室文件櫃頂級的

應該有99994和99993.＃分別在哪裏我失敗了呢？我應該如何解決這個問題？

我認爲，有一個更清潔的方式，喜歡用文字:: CSV_XS，但是，我遇到了草莓perl的毛刺，甚至測試模塊已正確安裝後。所以我回到了主動狀態。

來源

2013-03-18 Solutions

你可以給這個例子輸入所需的輸出？ – azhrei 2013-03-19 00:04:26

只是要清楚：你想抓的東西是'e-4'，'b/c6'，'5h'，'b 3'，'1A'，'3 F'，'C2'，'b2' ，'5b'，'g6'，'i9'，'D-2'，'9-H'和'd7'？ – Dougal 2013-03-19 00:04:41

不僅可以抓取這些文件，還可以將它們列爲文件中每個記錄的字母數字，即E4，C6，B3，A1等等。 – Solutions 2013-03-19 00:23:09

... 

my $coord; 
if ($note =~/
    (?&DEL) 

    ((?&ROW) (?&SEP)?+ (?&COL) 
    | (?&COL) (?&SEP)?+ (?&ROW) 
    ) 

    (?&DEL) 

    (?(DEFINE) 
     (?<ROW> [a-hA-H] ) 
     (?<COL> [1-9]  ) 
     (?<SEP> [\s~\@\-]++) 
     (?<DEL>^| \W | \z) 
    ) 
/x) { 
    $coord = $1; 
    (my $row = uc($coord)) =~ s/[^A-H]//g; 
    (my $col = uc($coord)) =~ s/[^1-9]//g; 
    $coord = "$row$col"; 
} 

...

來源

2013-03-19 00:19:51 ikegami

使用Text::CSV_XS解析CSV文件，它快速而準確。

然後構建一個正則表達式來匹配ID。

最後，標準化每個ID。

#!/usr/bin/perl 

use v5.10; 
use strict; 
use warnings; 
use autodie; 

use Text::CSV_XS; 

# Build up the regular expression to look for IDs 
my $Separator_Set = qr{ [- ] }x; 
my $ID_Letters_Set = qr{ [a-i] }xi; 
my $ID_Numbers_Set = qr{ [1-9] }x; 
my $Location_Re = qr{ 
    \b 
    $ID_Letters_Set $Separator_Set? $ID_Numbers_Set | 
    $ID_Numbers_Set $Separator_Set? $ID_Letters_Set 
    \b 
}x; 

# Initialize Text::CSV_XS and tell it this is a tab separated CSV 
my $csv = Text::CSV_XS->new({ 
    sep_char => "\t", # tab separated fields 
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag(); 

# Read in and discard the CSV header line. 
my $headers = $csv->getline(*DATA); 

# Output our own header line  
say "Record\tLocation\tNotes"; 

# Read each CSV row, extract and normalize the ID, and output a new row. 
while(my $row = $csv->getline(*DATA)) { 
    my($record, $notes) = @$row; 

    # Extract and normalize the ID 
    my($id) = $notes =~ /($Location_Re)/; 
    $id = normalize_id($id); 

    # Output a new row 
    printf "%d\t%s\t%s\n", $record, $id, $notes; 
} 


sub normalize_id { 
    my $id = shift; 

    # Return empty string if we were passed in a blank 
    return '' if !defined $id or !length $id or $id !~ /\S/; 

    my($letter) = $id =~ /($ID_Letters_Set)/; 
    my($number) = $id =~ /($ID_Numbers_Set)/; 

    return uc($letter).$number; 
} 

__END__ 
Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white)

來源

2013-03-19 01:06:24 Schwern

提取從csv文件

回答

相關問題