PHP網絡爬蟲不會抓取.php文件

這是簡單的WebCrawler我試圖建立PHP網絡爬蟲不會抓取.php文件

<?php 

    $to_crawl = "http://samplewebsite.com/about.php"; 

    function get_links($url) 
    { 
     $input = @file_get_contents($url); 
     $regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> "; 
     preg_match_all("/$regexp/siU", $input, $matches); 

     $l = $matches[2]; 

     foreach ($l as $link) { 
      echo $link."</br>"; 
     } 
    } 


    get_links($to_crawl); 


?>

當我試圖運行與文件結尾設置爲一個URL $ to_crawl變量腳本名稱，例如「facebook.com/about」，它可以工作，但由於某種原因，當鏈接以'.php'文件名結尾時，它只是回聲沒有。有人可以幫忙嗎？

來源

2015-09-21 Samir Chahine

你可以在瀏覽器中該鏈接的結果呢？ – wmk

是的，它工作正常，我通過我的Python寫的網絡爬行器，它完美的工作。 –

嘗試'$ regexp =「\\ s * ] * href =（\」??）（[^ \>>] *？）\\ 1 [^>] *>（。*）<\/a> \\ s * 「'。另外，你是否考慮使用DOMDocument？你似乎只是收集帶有內部文本的''標籤href網址。對？ –

要獲得所有鏈接和自己內心的文字，你可以使用DOMDocument這樣的：

$dom = new DOMDocument; 
@$dom->loadHTML($input);     // Your input (HTML code) 

$xp = new DOMXPath($dom); 
$links = $xp->query('//a[@href]');   // XPath to get only <a> tags with a href attribute 

$result = array(); 
foreach ($links as $link) { 
    $result[] = array($link->getAttribute("href"), $link->nodeValue); 
} 
print_r($result);

見 IDEONE demo

來源

2015-09-21 10:54:47

PHP網絡爬蟲不會抓取.php文件

回答

相關問題