2011-09-15 169 views
2

我需要在行之間提取文本行並將其填充到excel文件中。有行數之間的差異,但他們已經開始與 評論備案「IDNO」 ......等文字相同行之間的文本提取

__DATA__ (This is what my .txt file looks like) 
Comment for the record "id1" 
Attempt1 made on [time] outcome [outcome] 
note 1 

Comment for the record "id2" 
Attempt1 made on [time] outcome [outcome] 
note 1 
Attempt2 made on [time] outcome [outcome] 
note 2 

Comment for the record "id3" 
Attempt1 made on [time] outcome [outcome] 
note 1 
Attempt2 made on [time] outcome [outcome] 
note 2 
Attempt3 made on [time] outcome [outcome] 
note 3 
Attempt4 made on [time] outcome [outcome] 
note 4 

的字符串希望這顯示

id1  Attempt1 Note1 [outcome] 
id2  Attempt1 Note1 [outcome] 
id2  Attempt2 Note2 [outcome] 
id3  Attempt1 Note1 [outcome] 
id3  Attempt2 Note2 [outcome] 
id3  Attempt3 Note3 [outcome] 
id3  Attempt4 Note4 [outcome] 

結果值將改變並且將是2-3位數字代碼。

任何幫助將不勝感激。我在最後一天或2天瀏覽過這個網站,但由於我的經驗有限,我無法找到相關的東西,而且我是相當新的perl,shell認爲將它作爲一個問題發佈會更好。

類方面, 王牌

回答

1

我想你尋找這樣的事情。它打印CSV可以用Excel

use strict; 

local $/; 

block(/(id\d+)/,$_) for split /\n\n/, <DATA>; 

sub block { 
    my ($id,$block) = @_; 

    $block =~ s/.*?(?=Attempt)//s; 

    print join(',', $id, /(Attempt\d+)/, /([^\n]+)$/, /outcome (\d+)/)."\n" 
    for split /(?=Attempt)/, $block 
    ; 
} 
+0

CPAN也有一個簡單的Excel模塊,可能對此很有用。 – Sorpigal

2

使用GNU AWK(爲正則表達式捕獲組)打開

gawk ' 
    /^$/ {next} 
    match($0, /Comment for the record "([^"]*)/, a) {id = a[1]; next} 
    match($0, /(.+) made on .* outcome (.+)/, a) {att = a[1]; out = a[2]; next} 
    {printf("%s\t%s\t%s\t%s\n", id, att, $0, out)} 
' 

,或者翻譯成Perl:

perl -lne ' 
    chomp; 
    next if /^$/; 
    if (/Comment for the record "([^"]*)/) {$id = $1; next;} 
    if (/(.+) made on .* outcome (.+)/) {$att = $1; $out = $2; next;} 
    print join("\t", $id, $att, $_, $out); 
' 
1

除非我缺少的東西,它看起來很直截了當:

  • 您尋找一條以Comment開頭的行。這將包含您的ID。
  • 一旦你有一個ID,你會有一個嘗試線,後面跟着一條筆記線。閱讀試圖和之後將包含註釋的行。
  • 當你到下一個評論時,你需要衝洗並重復。

我們有一個特殊的結構:每個ID將有一個嘗試的數組。每次嘗試將包含結果註釋

我打算在這裏使用面向對象的Perl。我會將所有記錄ID放入一個列表,名爲@dataList,此列表中的每個項目都是Id類型。

每種類型Id將包括嘗試陣列,並且每個嘗試將具有標識時間成果,和

#! /usr/bin/perl 
# test.pl 

use strict; 
use warnings; 
use feature qw(say); 

######################################################################## 
# READ IN AND PARSE YOUR DATA 
# 

my @dataList; 

my $record; 
while (my $line = <DATA>) { 
    chomp $line; 
    if ($line =~ /^Comment for the record "(.*)"/) { 
     my $id = $1; 
     $record = Id->new($id); 
     push @dataList, $record; 
    } 
    elsif ($line =~ /^(\S+)\s+made on\s(\S+)\soutcome\s(.*)/) { 
     my $attemptId = $1; 
     my $time = $2; 
     my $outcome = $3; 

     # Next line is the note 

     chomp (my $note = <DATA>); 
     my $attempt = Attempt->new($attemptId, $time, $outcome, $note); 
     $record->PushAttempt($attempt); 
    } 
} 

foreach my $id (@dataList) { 
    foreach my $attempt ($id->Attempt) { 
     print $id->Id . "\t"; 
     print $attempt->Id . "\t"; 
     print $attempt->Note . "\t"; 
     print $attempt->Outcome . "\n"; 
    } 
} 
# 
######################################################################## 


######################################################################## 
# PACKAGE Id; 
# 
package Id; 
use Carp; 

sub new { 
    my $class  = shift; 
    my $id = shift; 

    my $self = {}; 

    bless $self, $class; 

    $self->Id($id); 

    return $self; 
} 

sub Id { 
    my $self = shift; 
    my $id = shift; 

    if (defined $id) { 
     $self->{ID} = $id; 
    } 

    return $self->{ID}; 
} 

sub PushAttempt { 
    my $self  = shift; 
    my $attempt = shift; 

    if (not defined $attempt) { 
     croak qq(Missing Attempt in call to Id->PushAttempt); 
    } 
    if (not exists ${$self}{ATTEMPT}) { 
     $self->{ATTEMPT} = []; 
    } 
    push @{$self->{ATTEMPT}}, $attempt; 

    return $attempt; 
} 

sub PopAttempt { 
    my $self = shift; 

    return pop @{$self->{ATTEMPT}}; 
} 

sub Attempt { 
    my $self = shift; 
    return @{$self->{ATTEMPT}}; 
} 


# 
######################################################################## 

######################################################################## 
# PACKAGE Attempt 
# 
package Attempt; 

sub new { 
    my $class  = shift; 
    my $id = shift; 
    my $time  = shift; 
    my $note  = shift; 
    my $outcome = shift; 

    my $self = {}; 
    bless $self, $class; 

    $self->Id($id); 
    $self->Time($time); 
    $self->Note($note); 
    $self->Outcome($outcome); 

    return $self; 
} 

sub Id { 
    my $self = shift; 
    my $id = shift; 


    if (defined $id) { 
     $self->{ID} = $id; 
    } 

    return $self->{ID}; 
} 

sub Time { 
    my $self = shift; 
    my $time = shift; 

    if (defined $time) { 
     $self->{TIME} = $time; 
    } 

    return $self->{TIME}; 
} 

sub Note { 
    my $self = shift; 
    my $note = shift; 

    if (defined $note) { 
     $self->{NOTE} = $note; 
    } 

    return $self->{NOTE}; 
} 

sub Outcome { 
    my $self  = shift; 
    my $outcome = shift; 

    if (defined $outcome) { 
     $self->{OUTCOME} = $outcome; 
    } 

    return $self->{OUTCOME}; 
} 
# 
######################################################################## 

package main; 

__DATA__ 
Comment for the record "id1" 
Attempt1 made on [time] outcome [outcome11] 
note 11 

Comment for the record "id2" 
Attempt21 made on [time] outcome [outcome21] 
note 21 
Attempt22 made on [time] outcome [outcome22] 
note 22 

Comment for the record "id3" 
Attempt31 made on [time] outcome [outcome31] 
note 31 
Attempt32 made on [time] outcome [outcome32] 
note 32 
Attempt33 made on [time] outcome [outcome33] 
note 33 
Attempt34 made on [time] outcome [outcome34] 
note 34 
0

這可能不是非常可靠的,但這裏有一個有趣的嘗試與sed

sed -r -n 's/Comment for the record "([^"]+)"$/\1/;tgo;bnormal;:go {h;n;};:normal /^Attempt[0-9]/{s/(.+) made on .* outcome (.+)$/\1 \2/;G;s/\n/ /;s/(.+) (.+) (.+)/\3\t\1\t\2/;N;s/\t([^\t]+)\n(.+)/\t\2\t\1/;p;d;}' data.txt 

注:GNU sed的唯一。如果需要,可移植性很容易實現。

2

您的數據與段落導向解析策略很好地吻合。因爲你的規範是模糊的,很難知道需要什麼正則表達式,但是這應該說明的一般方法:根據你的榜樣

use strict; 
use warnings; 

# Paragraph mode: read the input file a paragraph/block at a time. 
local $/ = ""; 

while (my $block = <>){ 
    # Convert the block to lines. 
    my @lines = grep /\S/, split("\n", $block); 

    # Parse the text, capturing needing items from @lines as we consume it. 
    # Note also the technique of assigning regex captures directly to variables. 
    my ($id) = shift(@lines) =~ /"(.+)"/; 
    while (@lines){ 
     my ($attempt, $outcome) = shift(@lines) =~ /(Attempt\d+).+outcome (\d+)/; 
     my $note = shift @lines; 
     print join("\t", $id, $attempt, $note, $outcome), "\n"; 
    } 
} 
+1

設置'$/=「\ n \ n」'意味着兩條換行符,而推薦的設置'$/=「」'意味着**兩條或更多的換行符**,以便它對多少空白行,每個記錄始終以真實數據開始。 – tchrist

+0

@tchrist不知道。謝謝你的提示。 – FMc

0

AWK oneliner。

kent$ awk 'NF==5{gsub(/\"/,"",$5);id=$5;next;} /^Attempt/{n=$1;gsub(/Attempt/,"Note",n);print id,$1,n,$6}' input      
id1 Attempt1 Note1 [outcome] 
id2 Attempt1 Note1 [outcome] 
id2 Attempt2 Note2 [outcome] 
id3 Attempt1 Note1 [outcome] 
id3 Attempt2 Note2 [outcome] 
id3 Attempt3 Note3 [outcome] 
id3 Attempt4 Note4 [outcome] 
相關問題