使用perl的網絡爬蟲

我想開發一個網頁爬蟲，它從一個種子URL開始，然後抓取它發現屬於與種子URL相同的域的100個html頁面，並保留遍歷的URL的記錄以避免重複。我寫了以下內容，但$ url_count值似乎沒有增加，並且檢索到的URL包含甚至來自其他域的鏈接。我該如何解決這個問題？在這裏，我插入了stackoverflow.com作爲我的起始URL。使用perl的網絡爬蟲

use strict; 
use warnings; 

use LWP::Simple; 
use LWP::UserAgent; 
use HTTP::Request; 
use HTTP::Response; 


##open file to store links 
open my $file1,">>", ("extracted_links.txt"); 
select($file1); 

##starting URL 
my @urls = 'http://stackoverflow.com/'; 

my $browser = LWP::UserAgent->new('IE 6'); 
$browser->timeout(10); 
my %visited; 
my $url_count = 0; 


while (@urls) 
{ 
    my $url = shift @urls; 
    if (exists $visited{$url}) ##check if URL already exists 
    { 
     next; 
    } 
    else 
    { 
     $url_count++; 
    }   

    my $request = HTTP::Request->new(GET => $url); 
    my $response = $browser->request($request); 

    if ($response->is_error()) 
    { 
     printf "%s\n", $response->status_line; 
    } 
    else 
    { 
     my $contents = $response->content(); 
     $visited{$url} = 1; 
     @lines = split(/\n/,$contents); 
     foreach $line(@lines) 
     { 
      $line =~ [email protected](((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g; 
      print "$1\n"; 
      push @urls, $$line[2]; 
     } 

     sleep 60; 

     if ($visited{$url} == 100) 
     { 
      last; 
     } 
    } 
} 

close $file1;

來源

2013-03-29 user2154731

請參閱此鏈接來獲得鏈接的根域名和比較，爲您的初始URL的根域：http://stackoverflow.com/questions/15627892/perl-regex-grab-everyting-直到/ 15628401＃15628401 – imran

由於您將要提取URL和鏈接，請開始使用WWW :: Mechanize，它將爲您處理大部分苦差事。 –

我不能使用它，因爲我應該在沒有該軟件包的服務器上運行代碼，並且我沒有安裝它們的權限。 – user2154731

有幾點，你的URL解析是脆弱的，你當然不會得到相對的鏈接。此外，您不會測試100個鏈接，但會搜索當前網址的100個匹配項，這幾乎肯定不是您的意思。最後，我對LWP不太熟悉，所以我將使用Mojolicious工具套件展示一個示例。

這似乎工作，也許它會給你一些想法。

#!/usr/bin/env perl 

use strict; 
use warnings; 

use Mojo::UserAgent; 
use Mojo::URL; 

##open file to store links 
open my $log, '>', 'extracted_links.txt' or die $!; 

##starting URL 
my $base = Mojo::URL->new('http://stackoverflow.com/'); 
my @urls = $base; 

my $ua = Mojo::UserAgent->new; 
my %visited; 
my $url_count = 0; 

while (@urls) { 
    my $url = shift @urls; 
    next if exists $visited{$url}; 

    print "$url\n"; 
    print $log "$url\n"; 

    $visited{$url} = 1; 
    $url_count++;   

    # find all <a> tags and act on each 
    $ua->get($url)->res->dom('a')->each(sub{ 
    my $url = Mojo::URL->new($_->{href}); 
    if ($url->is_abs) { 
     return unless $url->host eq $base->host; 
    } 
    push @urls, $url; 
    }); 

    last if $url_count == 100; 

    sleep 1; 
}

來源

2013-03-29 03:35:43

感謝您的回覆。但是由於缺少Mojolicious工具包，我無法試用您的代碼。 – user2154731

它很容易安裝。單行是這樣的：'curl get.mojolicio.us | sh' –

嗨Joel，謝謝你的代碼片段。但我認爲它需要調整來解決相關鏈接，否則頁面無法工作。爲了解決它，我創建了一個名爲$ baseURL的變量來保存起始url（在你的示例'http://stackoverflow.com'），然後我改變你的代碼，如下所示：'if（$ url-> is_abs）{return unless $ url-> host eq $ base-> host; } else {$ url = Mojo :: URL-> new（$ baseURL） - > path（$ _）; } push @urls，$ url; ' –

使用perl的網絡爬蟲

回答

相關問題