2012-12-16 175 views
3

我想寫一個最小的網絡爬蟲。其目的是從種子中發現新的URL並進一步抓取這些新的URL。代碼如下:遞歸網絡爬蟲perl

use strict; 
use warnings; 
use Carp; 
use Data::Dumper; 
use WWW::Mechanize; 

my $url = "http://foobar.com"; # example 
my %links; 

my $mech = WWW::Mechanize->new(autocheck => 1); 
$mech->get($url); 
my @cr_fronteir = $mech->find_all_links(); 

foreach my $links (@cr_fronteir) { 
    if ($links->[0] =~ m/^http/xms) { 
     $links{$links->[0]} = $links->[1]; 
    } 
} 

我在這裏卡住了,我怎麼可能再繼續爬進去%鏈路相連,並且還,我怎麼增加深度,以防止溢出。建議表示讚賞。

回答

4

不能有遞歸沒有使它的功能。

use strict; 
use warnings; 
use Carp; #unused, but I guess yours was a sample 
use Data::Dumper; 
use WWW::Mechanize; 

my %links; 
my $mech = WWW::Mechanize->new(autocheck => 1); 

sub crawl { 
    my $url = shift; 
    my $depth = shift or 0; 
    #this seems like a good place to assign some form of callback, so you can 
    # generalize this function 

    return if $depth > 10; #change as needed 

    $mech->get($url); 
    my @cr_fronteir = $mech->find_all_links(); 

    #not so sure what you're trying to do; before, $links in the 
    # foreach overrides the global %links 
    #perhaps you meant this...? 
    foreach my $link (@cr_fronteir) { 
     if ($link->[0] =~ m/^http/xms) { 
      $links{$link->[0]} = $link->[1]; 

      #be nice to servers - try not to overload them 
      sleep 3; 
      #recursion! 
      crawl($link->[0], depth+1); 
     } 
    } 
} 

crawl("http://foobar.com", 0); 

我沒有在這個分區上安裝Perl的,所以這是容易出現語法錯誤和其他惡作劇,但可以作爲一個基礎。

正如在第一功能評論說:而不是硬編碼映射功能,您可以通過它傳遞的回調,並呼籲,對每一個環節,你爬你的推廣功能更大光榮。

0

一些代碼:

while (scalar @links) { 
    my $link = shift @links; 
    process_link($link); 
} 

sub process_link { 
    my $link = shift; 

    $mech->get($link); 
    foreach my $page_link ($mech->find_all_links()) { 
     next if $links{$page_link}; 
     $links{$page_links} = 1; 
     push @links, $page_link; 
    } 
} 

P. S. /m/s修飾符在你的代碼不必要的(和/x太)。

+0

/M,/ s和/ X標誌:各種Perl的風格指南建議對每一個正則表達式將這些。/ms改變了一些新手不友好的正則表達式行爲,而/ x只是非常有用;-)我也總是用這三個標誌註釋我的正則表達式,直接需要或不需要。 –

5

Mojolicious Web框架提供了有用的一些有趣的功能的網絡爬蟲:

  • 除了Perl的V5.10無依賴或更高版本
  • URL解析器
  • DOM樹分析器
  • 異步HTTP/HTTPS客戶端(允許沒有fork()開銷的併發請求)

下面是一個遞歸爬行本地Apache文檔並顯示頁面標題和提取鏈接的示例。它採用4個並行連接,並且不進入深度超過3倍的水平,訪問每個提取得到的鏈接只有一次:

#!/usr/bin/env perl 
use 5.010; 
use open qw(:locale); 
use strict; 
use utf8; 
use warnings qw(all); 

use Mojo::UserAgent; 

# FIFO queue 
my @urls = (Mojo::URL->new('http://localhost/manual/')); 

# User agent following up to 5 redirects 
my $ua = Mojo::UserAgent->new(max_redirects => 5); 

# Track accessed URLs 
my %uniq; 

my $active = 0; 

sub parse { 
    my ($tx) = @_; 

    # Request URL 
    my $url = $tx->req->url; 

    say "\n$url"; 
    say $tx->res->dom->at('html title')->text; 

    # Extract and enqueue URLs 
    for my $e ($tx->res->dom('a[href]')->each) { 

     # Validate href attribute 
     my $link = Mojo::URL->new($e->{href}); 
     next if 'Mojo::URL' ne ref $link; 

     # "normalize" link 
     $link = $link->to_abs($tx->req->url)->fragment(undef); 
     next unless $link->protocol =~ /^https?$/x; 

     # Don't go deeper than /a/b/c 
     next if @{$link->path->parts} > 3; 

     # Access every link only once 
     next if ++$uniq{$link->to_string} > 1; 

     # Don't visit other hosts 
     next if $link->host ne $url->host; 

     push @urls, $link; 
     say " -> $link"; 
    } 

    return; 
} 

sub get_callback { 
    my (undef, $tx) = @_; 

    # Parse only OK HTML responses 
    $tx->res->code == 200 
     and 
    $tx->res->headers->content_type =~ m{^text/html\b}ix 
     and 
    parse($tx); 

    # Deactivate 
    --$active; 

    return; 
} 

Mojo::IOLoop->recurring(
    0 => sub { 

     # Keep up to 4 parallel crawlers sharing the same user agent 
     for ($active .. 4 - 1) { 

      # Dequeue or halt if there are no active crawlers anymore 
      return ($active or Mojo::IOLoop->stop) 
       unless my $url = shift @urls; 

      # Fetch non-blocking just by adding 
      # a callback and marking as active 
      ++$active; 
      $ua->get($url => \&get_callback); 
     } 
    } 
); 

# Start event loop if necessary 
Mojo::IOLoop->start unless Mojo::IOLoop->is_running; 

欲瞭解更多網絡刮提示&技巧,閱讀I Don’t Need No Stinking API: Web Scraping For Fun and Profit文章。