提取源URL和字符串中的錨文本

我想從一系列字符串中提取數據，但沒有運氣。在下面的示例代碼中，我嘗試使用preg_split，但它沒有給我我想要的結果。提取源URL和字符串中的錨文本

使用下面的代碼：

<?php 
$str = '<a href="http://rads.stackoverflow.com/amzn/click/B008EYEYBA">Nike Air Jordan SC-2 Mens Basketball Shoes 454050-035</a><img src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> 
'; 
$chars = preg_split('/ /', $str, -1, PREG_SPLIT_OFFSET_CAPTURE); 

echo '<pre>'; 
print_r($chars); 
echo '<pre>'; 
?>

給出結果：在陣列1

Array 
(
    [0] => Array 
     (
      [0] => 0 
     ) 

    [1] => Array 
     (
      [0] => href="http://rads.stackoverflow.com/amzn/click/B008EYEYBA">Nike 
      [1] => 3 
     ) 

    [2] => Array 
     (
      [0] => Air 
      [1] => 167 
     ) 

    [3] => Array 
     (
      [0] => Jordan 
      [1] => 171 
     ) 

    [4] => Array 
     (
      [0] => SC-2 
      [1] => 178 
     ) 

    [5] => Array 
     (
      [0] => Mens 
      [1] => 183 
     ) 

    [6] => Array 
     (
      [0] => Basketball 
      [1] => 188 
     ) 

    [7] => Array 
     (
      [0] => Shoes 
      [1] => 199 
     ) 

    [8] => Array 
     (
      [0] => 454050-035 205 
     ) 

    [9] => Array 
     (
      [0] => src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA" 
      [1] => 224 
     ) 

    [10] => Array 
     (
      [0] => width="1" 
      [1] => 305 
     ) 

    [11] => Array 
     (
      [0] => height="1" 
      [1] => 315 
     ) 

    [12] => Array 
     (
      [0] => border="0" 
      [1] => 326 
     ) 

    [13] => Array 
     (
      [0] => alt="" 
      [1] => 337 
     ) 

    [14] => Array 
     (
      [0] => style="border:none 
      [1] => 344 
     ) 

    [15] => Array 
     (
      [0] => !important; 
      [1] => 363 
     ) 

    [16] => Array 
     (
      [0] => margin:0px 
      [1] => 375 
     ) 

    [17] => Array 
     (
      [0] => !important;" 
      [1] => 386 
     ) 

    [18] => Array 
     (
      [0] => /> 

      [1] => 399 
     ) 

)

注意到，「耐克包含這個詞的時候我只需要僅僅是URL

[1] => Array 
     (
      [0] => href="http://rads.stackoverflow.com/amzn/click/B008EYEYBA">Nike 
      [1] => 3 
     )

。

實際上，我在提取$ str的最終目標只是將源URL和achor文本輸出到單獨的數組中柯這樣：

網址：

http://www.amazon.com/gp/product/B008EYEYBA/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B008EYEYBA&linkCode=as2&tag=mytwitterpage-20

錨文本：

耐克喬丹SC-2男子籃球鞋454050-035

任何想法如何，我可以完成這一點非常感謝。

來源

2012-11-23 anagnam

使用正則表達式來解析html是一種不好的做法。 PHP有DOM的擴展。你根本無法建立一個通用的正則表達式，它可以用於任何你可能遇到的html。 DOM方法更加可擴展。

$string = '<a href="http://rads.stackoverflow.com/amzn/click/B008EYEYBA">Nike Air Jordan SC-2 Mens Basketball Shoes 454050-035</a><img src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />'; 
$dom = new DOMDocument(); 
libxml_use_internal_errors(true); 
$dom->loadHTML($string); 
libxml_clear_errors(); 
$elementA = $dom->getElementsByTagName('a')->item(0); 
$aText = $elementA->nodeValue; 
$aLink = $elementA->getAttribute('href'); 
echo $aLink . "\n" . $aText;

來源

2012-11-23 13:04:03

我試過了，它給了我警告：DOMDocument :: loadHTML（）[domdocument.loadhtml]：htmlParseEntityRef：期待';'在實體中，行：1在 – anagnam

這是因爲只有整個文檔的一部分被證明。您可以靜音警告或提供完整的文檔。 –

添加了libxml警告抑制... –

你可以在php函數的幫助下做到這一點。

您想在此刪除錨標籤。

您可以使用strip_tags（）函數刪除所有標籤。

來源

2012-11-23 12:36:28 user1972007

感謝webduos，它的工作原理，但只給我錨文本。源URL如何？任何內置的函數都可以爲我提供與strip_tags（）相同的URL，並與錨文本一起使用？ – anagnam

提取源URL和字符串中的錨文本

回答

相關問題