2012-03-08 11 views
0

我試圖從表中的第二個單元格中使用正則表達式, 得到「日期」,但它不匹配,我真的找不到原因。如何從此HTML表格提取日期?

my $str = '"  
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td>'; 

if ($str =~ /Activation Date.*<td.*>(.*)</gm) { 
    print "matched: ".$1; 
}else{ 
    print "mismatched!"; 
} 
+4

[他來的小馬...](http://stackoverflow.com/a/1732454/554546) – 2012-03-08 18:39:20

+2

總之,請參閱[Tchrist的迴應](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string) – JRFerguson 2012-03-08 19:03:11

+0

@JFFerguson:我想我也在那裏做客串秀:-) – 2012-03-08 20:45:48

回答

4

其他人已經指出,您希望/s選項使.匹配換行符,以便您可以將邏輯行邊界與.*交叉。您可能還需要非貪婪.*?

use v5.10; 

my $html = <<'HTML';  
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
HTML 

my $regex = qr| 
    <td.*?>Activation \s+ Date:</td> 
     \s* 
    <td.*?class="dataEntry".*?>\s* 
     (\S+) 
    \s*</td> 
    |xs; 

if ($html =~ $regex) { 
    say "matched: $1"; 
    } 
else { 
    say "mismatched!"; 
    } 

如果你有完整的表,它更容易使用的東西,它知道如何解析表。讓一個模塊,如還有HTML::TableParser處理所有的細節:

use v5.10; 

my $html = <<'HTML'; 
    <table> 
    <tr> 
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
    </tr> 
    </table> 
HTML 

use HTML::TableParser; 

sub row { 
    my($tbl_id, $line_no, $data, $udata) = @_; 
    return unless $data->[0] eq 'Activation Date'; 
    say "Date is $data->[1]"; 
    } 

# create parser object 
my $p = HTML::TableParser->new( 
    { id => 1, row => \&row, } 
    { Decode => 1, Trim => 1, Chomp => 1, } 
    ); 
$p->parse($html); 

還有HTML::TableExtract

use v5.10; 

my $html = <<'HTML'; 
    <table> 
    <tr> 
    <td class="fieldLabel" height="18">Activation Date:</td> 
    <td class="dataEntry" height="18"> 
     10/27/2011  
    </td> 
    </tr> 
    </table> 
HTML 

use HTML::TableExtract; 

my $p = HTML::TableExtract->new; 
$p->parse($html); 
my $table_tree = $p->first_table_found; 
my $date = $table_tree->cell(0, 1); 
$date =~ s/\A\s+|\s+\z//g; 
say "Date is $date"; 
3

您可能會誤解正則表達式標誌。

  • /m意味着你可能會嘗試通過確保^可能意味着線的起點和$可能意味着行結束用於匹配多行。
  • /s意味着您想通過允許.表示任何字符(包括換行符)將多行表達式視爲單行表達式。通常,.表示任何字符,但換行符除外。

如果添加/s標誌,您的正則表達式應該可以工作,儘管you really shouldn't parse HTML with regex anyway