2016-02-26 62 views
-4

我有以下的Perl代碼:全球正則表達式匹配掛

# $content is the text of a webpage 
while ($content =~ /rgRow.*?<td>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>.*?<\/td><td.*?>(.*?)<\/td><td.*?><nobr>(.*?)<\/nobr><\/td>/sg) { 
    # do stuff 
} 

我曾指出,該代碼是掛在這個表達式調用。它會在while循環中進行2-3次迭代,然後它會掛起。我已經離開了大約30分鐘,並沒有繼續。

可能是什麼問題?

該代碼的目的是通過一些HTML並從中提取一些數據。

這裏是我設置$content到HTML:

<tbody> 
     <tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__0"> 
      <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : SECOND PERIODIC REPORT OF STATES PARTIES DUE IN 1974/MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.65/Add.1</td><td><nobr>21 Feb 1974</nobr></td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl04_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.65%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.65/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__1"> 
      <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : INITIAL REPORTS OF STATES PARTIES WHICH ARE DUE IN 1972/MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.33/Add.1</td><td><nobr>17 Jan 1972</nobr></td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl06_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.33%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.33/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__2"> 
      <td>Annex I to ALGERIA's Report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl08_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13691&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13691_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13691</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__3"> 
      <td>Annex II to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl10_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13692&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13692_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13692</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__4"> 
      <td>Annex III to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl12_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13693&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13693_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13693</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__5"> 
      <td>CERD-C-NZ-18-20_Annexes</td><td>Annex to State party report</td><td>CERD</td><td>New Zealand</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl14_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fNZL%2f13731&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_NZL_13731_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/NZL/13731</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__6"> 
      <td>CERD.C.RUS.20-22_Annex1</td><td>Annex to State party report</td><td>CERD</td><td>Russian Federation</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl16_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fRUS%2f13732&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">R</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_RUS_13732_R.doc</td><td style="display:none;">INT/CERD/ADR/RUS/13732</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__7"> 
      <td>Annex to State party report</td><td>Annex to State party report</td><td>CERD</td><td>Poland</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl18_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fPOL%2f15432&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_POL_15432_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/POL/15432</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__8"> 
      <td>Annexe X</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl20_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15561&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15561_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15561</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__9"> 
      <td>Annexe XI</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl22_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15562&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15562_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15562</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
</tr> 
</tbody> 

我想下面的行,看看它是如何去代替:

while ($content =~ m/rgRow.+?<td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td>/gs) 

原始代碼是不是我的。

+0

請顯示您正在嘗試解析的HTML。無論如何,正則表達式不是解析HTML的正確工具,爲什麼不使用HTML解析器? –

+5

[需要閱讀的人試圖用正則表達式解析XML/HTML](http://stackoverflow.com/a/1732454/18157)。簡介:不要使用正則表達式解析HTML/XML,請使用適當的解析器。 –

+0

同意上面的說法,但是如果你需要這樣做,那麼如何用'qr'打破這個討厭的陣容呢?看起來要容易得多。 – zdim

回答

0

我把這個問題當作調試舊代碼的問題。 (儘管如此,請參閱解析器示例的結尾。)

報告的問題是正則表達式掛起。對於我來說,它會在第一場比賽結束後退出。我的第一個嫌疑人是一個鬆散的新線; /s修飾符只會使.匹配一個新行。另一個嫌疑人是rgRow短語明確匹配 - 它也是<td>標籤中的一個屬性,所以在.*下也匹配 - 衝突?最後,正則表達式顯式尋找每個單元,同時使用/g修飾符。作爲參考,這是正則表達式,用於代碼/sg修飾符。

$patt = qr/rgRow.*? 
    <td> (.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> .*? <\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> <nobr>(.*?)<\/nobr> <\/td> 
/x; 

通過char拾取源char是不愉快的,它通常不起作用。我們可以改爲:刪除新行,然後將<td>標籤的內容捕獲到數組中。正則表達式的目的正是爲了解決這個問題。 (我改變正則表達式的分隔符,以避免編輯着色。)

use warnings; 
use strict; 

my $msg = 'pulled_from_url'; 
(my $msg_nonl = $msg) =~ s%\n%%g; 

my @raw_cells = $msg_nonl =~ |<td.*?>(.*?)<\/td>|g; 

# Once we are at it: strip <nobr>, &nbsp;, drop empty elements 
@cells = grep { !/^\s*$/ } map { s%<\/?nobr>|&nbsp;%%g; $_ } @raw_cells; 
# Get links ("View Document") out as well 
@content = grep { !/<a.*?\/a>/ } @cells; 
print "Total of " . scalar(@raw_cells) . " cells. "; 
print "Cleaned up, down to " . scalar(@content) . " cells.\n"; 
print "$_\n" for @content; 

這將打印單元的內容,在此編輯的空間

 
Total of 280 cells. Cleaned up, down to 82 cells. 
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1974/MOROCCO 
State party's report 
... 
21 Feb 1974 
... 
True 
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1972/MOROCCO 
State party's report 
... 
17 Jan 1972 
... 
True 

通過檢查,我們可以看到,內容是否正確拉HTML。

我並不是要判斷海報的動機,而是判斷限制。但是,我無法幫助它,但將上面的猜測工作和仔細的來源閱讀與以下內容進行比較。

use HTML::TableExtract; 
my $te = HTML::TableExtract->new(keep_html => 1); 
$te->parse("<table> " . $msg . "</table>"); 
# We have one table, use top-level 'rows()' shorthand method 
foreach my $row ($te->rows) { 
    print join(',', @$row), "\n"; 
} 

這會報告相同的280個單元格(添加計數時)並打印相同的行作爲上述步驟之一。我只需要瀏覽源代碼就可以看到它缺少<table>標籤。 HTML::TableExtractHTML::Parser的一個子類。

0

您的正則表達式要求第六列包含<nobr>...</nobr>標籤,它只發生在前兩行。它之後就會掛起來,因爲非貪婪的量詞只能做很多事情。當不可能匹配時,它們就像貪婪的品種一樣容易遭受災難性的回溯。

而不是依靠.*?所有的時間,試圖具體說明你想要匹配。在這種情況下,這很簡單:您匹配的TD不會包含其他標籤,因此您可以使用[^<>]*來捕獲其內容。事實上,你應該在目前使用的地方使用.*?

在下面的正則表達式中,我還將NOBR標記設置爲可選項,再加上我將其擴展爲匹配整個打開的TR標記,更爲了可讀性的緣故。

while ($content =~ 
    m!<tr\s+class="rgRow[^<>]*>\s* 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>[^<>]*</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>(?:<nobr>)?([^<>]*)(?:</nobr>)?</td> 
    !sxg) { 
    # do stuff 
}