如何獲取PHP網頁中的鏈接列表？

可能重複：
Parse Website for URLs 如何獲取PHP網頁中的鏈接列表？

我如何在網頁中的所有鏈接使用PHP？

我需要獲得鏈接的列表： -

Google

我想取HREF（http://www.google.com）和文本（谷歌）

--- ----------------情況是： -

我正在構建一個爬蟲，我希望它獲取數據庫表中存在的所有鏈接。

來源

2011-06-11 Mesaber

有幾個方法可以做到這一點，但我會接近這個是類似以下內容的方式，

使用cURL擷取網頁，即：如果一切順利的

// $target_url has the url to be fetched, ie: "http://www.website.com" 
// $userAgent should be set to a friendly agent, sneaky but hey... 

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; 
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); 

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); 
curl_setopt($ch, CURLOPT_URL,$target_url); 
curl_setopt($ch, CURLOPT_FAILONERROR, true); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
curl_setopt($ch, CURLOPT_AUTOREFERER, true); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); 
curl_setopt($ch, CURLOPT_TIMEOUT, 10); 
$html = curl_exec($ch); 
if (!$html) { 
echo "<br />cURL error number:" .curl_errno($ch); 
echo "<br />cURL error:" . curl_error($ch); 
exit; 
}

那麼，頁面內容現在都在$ html中。

讓我們繼續前進，並在DOM對象加載頁面：

$dom = new DOMDocument(); 
@$dom->loadHTML($html);

到目前爲止好，XPath來救援刮鏈接出來的DOM對象：

$xpath = new DOMXPath($dom); 
$hrefs = $xpath->evaluate("/html/body//a");

循環通過結果並獲得鏈接：

for ($i = 0; $i < $hrefs->length; $i++) { 
$href = $hrefs->item($i); 
$link = $href->getAttribute('href'); 
$text = $href->nodeValue 

    // Do what you want with the link, print it out: 
    echo $text , ' -> ' , $link; 

    // Or save this in an array for later processing.. 
    $links[$i]['href'] = $link; 
    $links[$i]['text'] = $text;       
}

$ hrefs是DOMNodeList類型的對象，item（）返回D指定索引的OMNode對象。所以基本上我們已經有了一個循環來檢索每個鏈接作爲DOMNode對象。

這應該幾乎爲你做。我不是100％確定的唯一部分是，如果鏈接是圖片或錨點，那麼在這些情況下會發生什麼，我不知道，因此您需要測試並過濾掉這些內容。

希望這給你一個如何刮鏈接，快樂編碼的想法。

來源

2011-06-11 09:11:28 stefgosselin

如何獲取PHP網頁中的鏈接列表？

回答

相關問題