如何使用PHP preg_match_all來區分由內部HTML元素的屬性標識的錨元素？

我有一組包含圖像元素的HTML錨點元素。對於每一組，使用PHP-CLI，我想拉取URL並根據它們的類型對它們進行分類。錨的類型只能由其子圖像元素的屬性確定。如果每組只有一種類型，那將很容易。我的問題是當一個類型的兩個錨元素被一個或多個其他類型分隔時。我的非貪婪加括號的子模式似乎變得貪婪，並擴展到找到第二個相關的子屬性。在我的測試腳本中，我試圖從其他類型中拉出「Userlink」URL。用一個簡單的模式，如：如何使用PHP preg_match_all來區分由內部HTML元素的屬性標識的錨元素？

#<a href="(.*?)" custattr="value1"><img alt="Userlink"#

一組這樣的：

<li><a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet1.com/username1" custattr="value1"><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet2.com/username1" custattr="value1"><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></a></li><li><a href="mailto:[email protected]" custattr="value1"><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></a></li>

（抱歉，但實際的HTML是在這樣的一個行）

我的子模式從捕捉第一個「Userlink」URL開始到最後一個結尾。

我已經嘗試了許多變化的預見，不知道我應該在這裏列出它們。到目前爲止，他們根本沒有返回任何匹配，或者與上述相同。

這裏是我的測試腳本（在Bash shell中運行）：

#!/usr/bin/php 
<? 
    $lines = 0; 
    $input = ""; 
    $matches = array(); 

    while ($line = fgets(STDIN)){ 
     $input .= $line; 
     $lines++; 
    } 
    fwrite(STDERR, "Processing $lines\n"); 

    $pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#'; 

    if (preg_match_all($pcre,$input,$matches)){ 
     fwrite(STDERR, "\$matches has " . count($matches) . " elements\n"); 
     foreach ($matches[1] as $match){ 
      fwrite(STDOUT, $match . "\n"); 
     } 
    } 
?>

什麼PCRE模式爲PHP的preg_match_all（）將返回兩個「Userlink」網址，在上面的例子？

來源

2014-02-27 TrooPhalce

[**不要使用正則表達式**解析HTML]（http://stackoverflow.com/a/1732454/2057919）。使用解析器。 –

而不是使用不真實的'。*？'使用貪婪的字符類'[^「] *'。 –

正如Ed Cottrell用這個* ^？！＃鏈接，如果你只想找到href內容，使用DOM可以是一個不錯的選擇 –

我已改變你的變量名的自由：

$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~'; 

if ($nb = preg_match_all($pattern, $input, $matches)) { 
    fwrite(STDERR, "\$matches has " . $nb . " elements\n"); 
    fwrite(STDOUT, implode("\n", $match) . "\n"); 
}

注意， preg_match_all函數返回匹配的數目。

來源

2014-02-27 20:31:46

此正則表達式應該工作 -

<a href="([^"]*?)"[^>]*\><img alt="Userlink"

你可以看到它是如何工作的here。

測試它 -

$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/'; 
if (preg_match_all($pcre,$input,$matches)){ 
    var_dump($matches); 
    //$matches[1] will be the array containing the urls. 
} 
/* 
    OUTPUT- 
    array 
     0 => 
     array 
      0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85) 
      1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85) 
     1 => 
     array 
      0 => string 'http://www.userlink1.com/my/page.html' (length=37) 
      1 => string 'http://www.userlink2.com/my/page.html' (length=37) 
*/

來源

2014-02-27 20:31:02 Kamehameha

如何使用PHP preg_match_all來區分由內部HTML元素的屬性標識的錨元素？

回答

相關問題