2017-06-20 76 views
0

我是編程的新手,所以這是我的問題。我正在嘗試構建一個遞歸的php spider usind簡單的HTML DOM解析器,爬入某個網站並返回包含2xx,3xx,4xx和5xx的頁面列表。我一直在尋找解決方案的幾天,但(可能是由於我的經驗不足),我還沒有找到任何工作。我的實際代碼找到根/索引頁面上的所有鏈接,但是我希望能夠找到那些先前找到的鏈接內的遞歸等鏈接,例如到第5級。假設根頁面是第0級,遞歸函數我只寫了1級鏈接,重複5次。任何幫助讚賞。謝謝。使用簡單的HTML查找嵌套鏈接DOM(遞歸)

<?php 
    echo "<strong><h1>Sitemap</h1></strong><br>"; 

    include_once('simple_html_dom.php'); 

    $url = "http://www.gnet.it/"; 
    $html = new simple_html_dom(); 
    $html->load_file($url); 
    echo "<strong><h2>Int Links</h2></strong><br>"; 
    foreach($html->find("a") as $a) 
    { 
    if((!(preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#")) 
    { 
    echo "<strong>" . $a->href . "</strong><br>"; 
    } 
    } 

    echo "<strong><h2>Ext Links</h2></strong><br>"; 
    foreach($html->find("a") as $a) 
    { 
    if(((preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#")) 
    { 
    echo "<strong>" . $a->href . "</strong><br>"; 
    } 
    } 


//recursion 

    $depth = 1; 
    $maxDepth = 5; 
    $recurl = "$a->href"; 
    $rechtml = new simple_html_dom(); 
    $rechtml->load_file($recurl); 
     while($depth <= $maxDepth){ 
     echo "<strong><h2>Link annidati livello $depth</h2></strong><br>"; 
     foreach($rechtml->find("a") as $a) 
     { 
      if(($a->href != null)) 
      { 
      echo "<strong>" . $a->href . "</strong><br>"; 
      } 
     } 
     $depth++; 
     } 


//csv 

    echo "<strong><h1>Google Crawl Errors from CSV</h1></strong><br>"; 
    echo "<table>\n\n"; 
$f = fopen("CrawlErrors.csv", "r"); 
while (($line = fgetcsv($f)) !== false) { 
     echo "<tr>"; 
     foreach ($line as $cell) { 
       echo "<td>" . htmlspecialchars($cell) . "</td>"; 
     } 
     echo "</tr>\n"; 
} 
fclose($f); 
echo "\n</table>"; 
?> 

回答

0

試試這個:

我把這種日常的基本刮刀遞歸找到所有整個網站的鏈接。您必須加入一些邏輯,以防止它從您網站上的網頁上抓取外部網站,否則您將永遠在運行!

請注意,我確實從另一個SO線程獲得了大部分代碼,所以答案就在那裏。

function crawl_page($url, $depth = 2){ 

// strip trailing slash from URL 
if(substr($url, -1) == '/') { 
    $url= substr($url, 0, -1); 
} 

// which URLs have we already crawled? 
static $seen = array(); 
if (isset($seen[$url]) || $depth === 0) { 
    return; 
} 
$seen[$url] = true; 

$dom = new DOMDocument('1.0'); 
@$dom->loadHTMLFile($url); 

$anchors = $dom->getElementsByTagName('a'); 
foreach ($anchors as $element) { 
    $href = $element->getAttribute('href'); 
    if (0 !== strpos($href, 'http')) { 
     // build the URLs to the same standard - with http:// etc 
     $path = '/' . ltrim($href, '/'); 
     if (extension_loaded('http')) { 
      $href = http_build_url($url, array('path' => $path)); 
     } else { 
      $parts = parse_url($url); 
      $href = $parts['scheme'] . '://'; 
      if (isset($parts['user']) && isset($parts['pass'])) { 
       $href .= $parts['user'] . ':' . $parts['pass'] . '@'; 
      } 
      $href .= $parts['host']; 
      if (isset($parts['port'])) { 
       $href .= ':' . $parts['port']; 
      } 
      $href .= $path; 
     } 
    } 
    crawl_page($href, $depth - 1); 
} 

// pull out the actual page name without any parent dirs 
$pos = strrpos($url, '/'); 
$slug = $pos === false ? "root" : substr($url, $pos + 1); 

echo "slug:" . $slug . "<br>"; 
}