preg_replace在輸入爲html時（但不是全部時間）返回null

我正在閱讀來自幾個不同來源的HTML，我必須操作它。作爲這個的一部分，我有一些preg_replace（）調用，我必須替換接收到的html中的一些信息。preg_replace在輸入爲html時（但不是全部時間）返回null

在90％的網站上，我必須這樣做，一切正常，剩下的10％在每個preg_replace（）調用返回NULL。

我試着增加pcre.backtrack_limit和pcre.recursion_limit基於我發現的其他文章似乎有同樣的問題，但這已無濟於事。

我輸出了返回'4'的preg_last_error（），PHP文檔並沒有證明它非常有幫助，所以如果任何人都可以在這裏指出任何亮點，它可能會開始指向正確的方向，但我很難過。

之一違規的例子是：

$html = preg_replace('@<script[^>]*?.*?</script>@siu', '', $html);

但正如我所說，這個工作90％的時間。

來源

2011-01-28 Simon

我不知道該怎麼``解釋，但它似乎是多餘的（相當於``，不是嗎？）。 – pascal 2011-01-28 16:22:27

另外爲什麼不包括最初的``？ – pascal 2011-01-28 16:23:02

不要使用正則表達式解析HTML。使用真正的DOM解析器：

$dom = new DOMDocument; 
$dom->loadHTML($html); 
$scripts = $dom->getElementsByTagName('script'); 
while ($el = $scripts->item(0)) { 
    $el->parentNode->removeChild($el); 
} 
$html = $dom->saveHTML();

來源

2011-01-28 16:18:41 lonesomeday

你有壞的utf-8。

/** 
* Returned by preg_last_error if the last error was 
* caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available 
* since PHP 5.2.0. 
* @link http://php.net/manual/en/pcre.constants.php 
*/ 
define ('PREG_BAD_UTF8_ERROR', 4);

但是，你真的不應該使用正則表達式來解析html。使用DOMDocument

編輯：另外，我不認爲這個答案是不完整的，包括You can't parse [X]HTML with regex.

來源

2011-01-28 16:19:27

你的＃4的錯誤是「PREG_BAD_UTF8_ERROR」，你應該檢查的網站使用的字符集至極導致此錯誤。

來源

2011-01-28 16:21:30 soju

您可能超出了回溯和/或內部遞歸限制。見http://php.net/manual/en/pcre.configuration.php

的preg_replace前試試這個：

ini_set('pcre.backtrack_limit', '10000000'); 
ini_set('pcre.recursion_limit', '10000000');

來源

2011-01-28 16:22:36 arnaud576875

preg_replace在輸入爲html時（但不是全部時間）返回null

回答

相關問題