爲什麼功能正則表達式使用PHP的preg_match_all（）失敗？

我在PHP腳本下面的正則表達式爲什麼功能正則表達式使用PHP的preg_match_all（）失敗？

$total_matches = preg_match_all('{ 

     <a\shref=" 
     (?<link>[^"]+) 
     "(?:(?!src=).)+src=" 
     (?<image>[^"]+) 
     (?:(?!designer-name">).)+designer-name"> 
     (?<brand>[^<]+) 
     (?:(?!title=).)+title=" 
     (?<title>((?!">).)+) 
     (?:(?!"price">).)+"price">\$ 
     (?<price>[\d.,]+) 

}xsi',$output,$all_matches,PREG_SET_ORDER);

此正則表達式解析似乎以下（通過PHP或使用分析器在regexr.com（與不區分大小寫設置相同的選項時，做工精細，擴展，治療換行符爲空格）：

<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title= 
    "DORDOGNE 120 PLATEAU SANDALEN" class="product-image"> 
    <img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_ image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg" 
    width="230" height="260" 
    alt= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" 
    title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> 
<img class="image2nd" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg" 
width="230" height="260" alt= 
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title= 
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class= 
"availability"><strong>available sizes</strong><br /></span></a> 

<div style="margin-left: 2em" class="available-sizes"> 
<h2 class="designer-name">Christian Louboutin</h2> 

<div class="product-buttons"> 
    <div class="product-button"> 
    NEW ARRIVAL 
    </div> 

    <div class="clearer"></div> 
</div> 

<h3 class="product-name"><a href= 
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title= 
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3> 

<div class="price-box"> 
    <span class="regular-price" id="product-price-114114"><span class= 
    "price">$805.00</span></span> 
</div>

如果我試圖在一排來解析多個匹配，它的工作原理也無妨但是當我嘗試解析完整的網頁，這些匹配來自（我有許可證se this）

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

正則表達式失敗（我實際上得到一個500錯誤）。我試過增加回溯限制使用

ini_set('pcre.backtrack_limit',100000000); 
ini_set('pcre.recursion_limit',100000000);

但這並不能解決問題。我想知道我在做什麼錯誤，導致正則表達式通過PHP失敗時，似乎是有效的，並匹配相關頁面上的代碼。擺弄它似乎表明負面的lookaheads（與頁面長度一起）導致了問題，但我不確定我是如何搞砸他們的。我正在運行PHP 5.2.17。

來源

2011-08-10 jela

和使用必須使用有內容的許可？ – 2011-08-10 03:17:09

同時檢查'PCRE_VERSION'常量。如果它合理過時，請嘗試安裝更新的'libpcre'。 '（？！..）。+）'斷言可能是昂貴的。除非你想重寫正則表達式或將它分解成preg_replace_callback，否則考慮使用像phpQuery或QueryPath這樣的html工具包進行提取（更容易，而且通常不會顯着變慢）。 – mario

@mario我的PCRE_VERSION是8.02 2010-03-19，我不確定它是否符合舊版本（它的4個版本過時）。我想我可能不得不重新修正這個正則表達式。我很驚訝這個lookaheads很貴，但我認爲你可能是對的。如果我不能重寫正則表達式，我會研究phpQuery和QueryPath。 – jela

你犯了一個經典失誤！不要使用正則表達式來解析HTML！它打破了正則表達式！（這是在「絕不參與亞洲地區戰爭」和「當死亡在線時不要與西西里人對抗」）。

你應該使用SimpleXML或的DomDocument解析這個：

$dom = new DomDocument(); 
$dom->loadHTML('http://www.mytheresa.com/us_en/new-arrivals/'. 
       'what-s-new-this-week-1.html?limit=12'); 

$path = new DomXPath($dom); 
// this query is based on the link you provided, not your regex 
$nodes = $path->evaluate('//ul[class="products-grid first odd"]/li'); 
foreach($nodes as $node) 
{ 
    // children 0 = anchor tag you're looking for initially. 
    echo $node->children[0]->getAttribute("href"); 
    // iterate through the other children that way 
}

來源

2011-08-10 03:49:51 cwallenpoole

我們需要一個新的「不可思議」徽章！ – Phil

來吧，它是*當然可以想象的*有時唯一的機會，如果你有巨大的傳統frontpage cruft忍受。 – ZJR

@ZJR你錯過了機會說：「這個詞，我不認爲這意味着你的想法。」 – cwallenpoole

那些消極的向前看符號是聰明的，但後來......稍微太聰明。

我同意，你使用太多，沒有得到副作用。

無法看到哪一個是猖獗的權利，但把一個重複.這樣...總是勢必會給你貪婪問題。

這個例如，肯定是不必要的：

title=" 
(?<title>((?!">).)

，你可以寫它

title="(?<title>.*?)">

...還有更多喜歡它。我會改變他們。

一般情況下，正則表達式調試意味着的改寫，再而三又一次改寫它，使用不同的結構，直到找到正確的平衡和之間功能 mantainability。

另一件事：我會用<a\s+而不是<a\s，只要稍微更加靈活。
保持略微靈活，它支付。

也：title=可以顯示自己title\s*=\s*

來源

2011-08-10 04:34:10 ZJR

對於標題來說這是一個有趣的案例，因爲從技術角度來看，這個lookahead是多餘的。問題是，有時編寫html的人無法正確編碼標題中的雙引號，這意味着我不能相信雙引號本身意味着標題的結尾。無論如何，我會開始用懶惰的星星替換負面的lookahead，看看會發生什麼。您肯定要添加空格。 – jela

爲什麼功能正則表達式使用PHP的preg_match_all（）失敗？

回答

相關問題