Perl在不同情況下找到有效的行對

我以每個GET/POST的製表符分隔形式提供HTTP頭請求和回覆數據，並在不同的行中進行回覆。這個數據是這樣的，一個TCP流有多個GET，POST和REPLY。我只需要從這些案例中選擇第一個有效的GET - REPLY對。一個例子（簡體）是：Perl在不同情況下找到有效的行對

ID  Source Dest Bytes Type Content-Length host    lines.... 
1   A   B  10  GET  NA   yahoo.com   2 
1   A   B  10  REPLY  10   NA     2 
2   C   D  40  GET  NA   google.com   4 
2   C   D  40  REPLY  20   NA     4 
2   C   D  40  GET  NA   google.com   4 
2   C   D  40  REPLY  30   NA     4 
3   A   B  250 POST  NA   mail.yahoo.com  5 
3   A   B  250 REPLY  NA   NA     5 
3   A   B  250 REPLY  15   NA     5 
3   A   B  250 GET  NA   yimg.com    5 
3   A   B  250 REPLY  35   NA     5 
4   G   H  415 REPLY  10   NA     6 
4   G   H  415 POST  NA   facebook.com   6 
4   G   H  415 REPLY  NA   NA     6 
4   G   H  415 REPLY  NA   NA     6 
4   G   H  415 GET  NA   photos.facebook.com 6 
4   G   H  415 REPLY  50   NA     6 

....

所以，基本上我需要一個請求 - 應答對每個ID，並將其寫入到一個新的文件。

'1'只是一對，所以很容易。 但是也有假兩種情況都是GET，POST或REPLY。所以，這種情況被忽略。

對於'2'，我會選擇第一個GET - REPLY對。

對於'3'，我會選擇第一個GET，但第二個REPLY，因爲Content-Length在第一個中不存在（使得最後的REPLY成爲更好的候選者）。

對於'4'，我會選擇第一個POST（或GET），因爲第一個標題不能是REPLY。即使POST之後的內容長度丟失，我也不會在第二次GET之後選擇REPLY，因爲之後有REPLY。所以我只會選擇第一個REPLY。

因此，在選擇最佳請求和回覆對之後，我需要將它們組合在一條線上。對於例如，輸出將是：

ID  Source Dest Bytes Type Content-Length host   .... 
    1   A   B  10  GET  10   yahoo.com 
    2   C   D  40  GET  20   google.com 
    3   A   B  250 POST  15   mail.yahoo.com 
    4   G   H  415 POST  NA   facebook.com

有很多實際數據的其他頭，但這個例子中幾乎顯示了我所需要的。在Perl中如何做到這一點？我幾乎陷入了困境，因此我一次只能讀取一行文件。

open F, "<", "file.txt" || die "Cannot open $f: $!"; 

    while (<F>) { 
    chomp; 
    my @line = split /\t/; 


     # get the valid pairs for cases with multiple request - replies 


     # get the paired up data together 

    } 
    close (F);

* 編輯：我已添加一個附加列給出HTTP標題行用於每個ID的數目。這可能有助於瞭解後續要檢查的行數。此外，我修改了ID'4'，以便第一個標題行是REPLY。 *

來源

2012-04-29 sfactor

+1。謝謝！ –

ID是否足以識別要處理的行組？如果是這樣，那麼在ID中，我們可以假設來源和目的地是一樣的嗎？ –

@JonathanLeffler是的，這就足夠了，因爲它表示一個具有相同源和目的地的TCP流，端口等。所以，我需要爲每個ID做一個請求 - 應答對，如圖所示。 – sfactor

下面的程序做我認爲你需要的。

這是評論，我認爲這是相當清晰。請問有沒有不清楚的地方。

use strict; 
use warnings; 

use List::Util 'max'; 

my $file = $ARGV[0] // 'file.txt'; 
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!); 

# Read the field names from the first line to index the hashes 
# Remember where the data in the file starts so we can get back here 
# 
my @fields = split ' ', <$fh>; 
my $start = tell $fh; 

# Build a format to print the accumulated data 
# Create a hash that relates column headers to their widths 
# 
my @headers = qw/ ID Source Dest Bytes Type Content-Length host /; 
my %len = map { $_ => length } @headers; 

# Read through the file to find the maximum data width for each column 
# 
while (<$fh>) { 
    my %data; 
    @data{@fields} = split; 
    next unless $data{ID} =~ /^\d/; 
    $len{$_} = max($len{$_}, length $data{$_}) for @headers; 
} 

# Build a format string using the values calculated 
# 
my $format = join ' ', map sprintf('%%%ds', $_), @len{@headers}; 
$format .= "\n"; 

# Go back to the start of the data 
# Print the column headers 
# 
seek $fh, $start, 0; 
printf $format, @headers; 

# Build transaction data hashes into $record and print them 
# Ignore any events before the first request 
# Ignore the second request and anything after it 
# Update the stored Content-Length field if a value other than NA appears 
# 
my $record; 
my $nreq = 0; 

while (<$fh>) { 

    my %data; 
    @data{@fields} = split; 
    my ($id, $type) = @data{ qw/ ID Type/}; 
    next unless $id =~ /^\d/; 

    if ($record and $id ne $record->{ID}) { 
    printf $format, @{$record}{@headers}; 
    undef $record; 
    $nreq = 0; 
    } 

    if ($type eq 'GET' or $type eq 'POST') { 
    $record = \%data if $nreq == 0; 
    $nreq++; 
    } 
    elsif ($nreq == 1) { 
    if ($record->{'Content-Length'} eq 'NA' and $data{'Content-Length'} ne 'NA') { 
     $record->{'Content-Length'} = $data{'Content-Length'}; 
    } 
    } 
} 

printf $format, @{$record}{@headers} if $record;

輸出

隨着問題給出的數據，這個程序的什麼需要詳細的解釋產生

ID Source Dest Bytes Type Content-Length     host 
1  A  B  10  GET    10    yahoo.com 
2  C  D  40  GET    20   google.com 
3  A  B  250 POST    15  mail.yahoo.com 
4  G  H  415 POST    NA   facebook.com

來源

2012-05-02 14:33:25 Borodin

這似乎是在給定的數據工作：

#!/usr/bin/env perl 
use strict; 
use warnings; 

# Shape of input records 
use constant ID  => 0; 
use constant Source => 1; 
use constant Dest  => 2; 
use constant Bytes => 3; 
use constant Type  => 4; 
use constant Length => 5; 
use constant Host  => 6; 

use constant fmt_head => "%-6s %-6s %-6s %-6s %-6s %-6s %s\n"; 
use constant fmt_data => "%-6d %-6s %-6s % 6d %-6s % 6s %s\n"; 

printf fmt_head, "ID", "Source", "Dest", "Bytes", "Type", "Length", "Host"; 

my @post_get; 
my @reply; 
my $lastid = -1; 
my $pg_count = 0; 

sub print_data 
{ 
    # Final validity checking 
    if ($lastid != -1) 
    { 
     printf fmt_data, $post_get[ID], $post_get[Source], 
       $post_get[Dest], $post_get[Bytes], $post_get[Type], $reply[Length], $post_get[Host]; 
     # Reset arrays; 
     @post_get =(); 
     @reply =(); 
     $pg_count = 0; 
    } 
} 

while (<>) 
{ 
    chomp; 
    my @record = split; 
    # Validate record here (number of fields, etc) 
    # Detect change in ID 
    print_data if ($record[ID] != $lastid); 
    $lastid = $record[ID]; 

    if ($record[Type] eq "REPLY") 
    { 
     # Discard REPLY if there wasn't already a POST/GET 
     next unless defined $post_get[ID]; 
     # Discard REPLY if there was a second POST/GET 
     next if $pg_count > 1; 
     @reply = @record if !defined $reply[ID]; 
     $reply[Length] = $record[Length] 
         if $reply[Length] eq "NA" && $record[Length] ne "NA"; 
    } 
    else 
    { 
     $pg_count++; 
     @post_get = @record if !defined $post_get[ID]; 
     $post_get[Length] = $record[Length] 
          if $post_get[Length] eq "NA" && $record[Length] ne "NA"; 
    } 
} 
print_data;

它產生：

ID Source Dest Bytes Type Content-Length    host 
1  A  B  10 GET    10  yahoo.com 
2  C  D  40 GET    20  google.com 
3  A  B  250 POST    15 mail.yahoo.com 
4  G  H  415 POST    NA  facebook.com

從問題的主要偏差是 '長' 的替代「的Content-Length 「;如果需要，修復很容易—將fmt_data和fmt_head中的第6個長度更改爲長度14，並將"Length"更改爲"Content-Length"。

來源

2012-04-29 16:58:48

在'print_data'中使用全局變量並依靠它來重置這些全局變量可能不是最好的主意。改爲使用引用，並清除主循環中的數組。另外，'chomp'不需要在空格上分割。但是，尊重製表符分隔格式並使用'chomp' +'split/\ t /'將是更好的選擇，IMO。 – TLP

另外，使用數組切片'printf fmt_data，@post_get [ID，Source，Dest，Bytes，Type]，$ reply [Length]，$ post_get [Host]'更具可讀性。 – TLP

@Jonathan Leffler：使用一個實際上是'enum'索引的數組似乎是不正當的，而不是一個簡單的哈希。 – Borodin

Perl在不同情況下找到有效的行對

回答

相關問題