如何從此HTML表格提取日期？

我試圖從表中的第二個單元格中使用正則表達式，得到「日期」，但它不匹配，我真的找不到原因。如何從此HTML表格提取日期？

my $str = '"  
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td>'; 

if ($str =~ /Activation Date.*<td.*>(.*)</gm) { 
    print "matched: ".$1; 
}else{ 
    print "mismatched!"; 
}

來源

2012-03-08 user1187968

[他來的小馬...]（http://stackoverflow.com/a/1732454/554546） – 2012-03-08 18:39:20

總之，請參閱[Tchrist的迴應]（http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string） – JRFerguson 2012-03-08 19:03:11

@JFFerguson：我想我也在那裏做客串秀:-) – 2012-03-08 20:45:48

其他人已經指出，您希望/s選項使.匹配換行符，以便您可以將邏輯行邊界與.*交叉。您可能還需要非貪婪.*?：

use v5.10; 

my $html = <<'HTML';  
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
HTML 

my $regex = qr| 
    <td.*?>Activation \s+ Date:</td> 
     \s* 
    <td.*?class="dataEntry".*?>\s* 
     (\S+) 
    \s*</td> 
    |xs; 

if ($html =~ $regex) { 
    say "matched: $1"; 
    } 
else { 
    say "mismatched!"; 
    }

如果你有完整的表，它更容易使用的東西，它知道如何解析表。讓一個模塊，如還有HTML::TableParser處理所有的細節：

use v5.10; 

my $html = <<'HTML'; 
    <table> 
    <tr> 
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
    </tr> 
    </table> 
HTML 

use HTML::TableParser; 

sub row { 
    my($tbl_id, $line_no, $data, $udata) = @_; 
    return unless $data->[0] eq 'Activation Date'; 
    say "Date is $data->[1]"; 
    } 

# create parser object 
my $p = HTML::TableParser->new( 
    { id => 1, row => \&row, } 
    { Decode => 1, Trim => 1, Chomp => 1, } 
    ); 
$p->parse($html);

還有HTML::TableExtract：

use v5.10; 

my $html = <<'HTML'; 
    <table> 
    <tr> 
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
    </tr> 
    </table> 
HTML 

use HTML::TableExtract; 

my $p = HTML::TableExtract->new; 
$p->parse($html); 
my $table_tree = $p->first_table_found; 
my $date = $table_tree->cell(0, 1); 
$date =~ s/\A\s+|\s+\z//g; 
say "Date is $date";

來源

2012-03-08 20:52:55

您可能會誤解正則表達式標誌。

/m意味着你可能會嘗試通過確保^可能意味着線的起點和$可能意味着行結束用於匹配多行。
/s意味着您想通過允許.表示任何字符（包括換行符）將多行表達式視爲單行表達式。通常，.表示任何字符，但換行符除外。

如果添加/s標誌，您的正則表達式應該可以工作，儘管you really shouldn't parse HTML with regex anyway。

來源

2012-03-08 18:31:56

如何從此HTML表格提取日期？

回答

相關問題