2010-09-21 92 views
0

我不能似乎得到這個代碼peice的工作:爲什麼我無法使用Perl的LWP :: Simple獲取www.google.com?

$self->{_current_page} = $href; 
    my $response = $ua->get($href); 
    my $responseCode = $response->code; 
    if($responseCode ne "404") { 
     my $content = LWP::Simple->get($href); 
     die "get failed: " . $href if (!defined $content); 
    } 

將返回錯誤:get failed: http://www.google.com

完整的代碼如下:

#!/usr/bin/perl 
use strict; 
use URI; 
use URI::http; 
use File::Basename; 
use DBI; 
use LWP::Simple; 
require LWP::UserAgent; 
my $ua = LWP::UserAgent->new; 
$ua->timeout(10); 
$ua->env_proxy; 
$ua->max_redirect(0); 


package Crawler; 
sub new { 
    my $class = shift; 
    my $self = { 
     _url => shift, 
     _max_link => 0, 
     _local => 1 
    }; 
    bless $self, $class; 
    return $self; 

} 
sub trim{ 
    my($self, $string) = @_; 
    $string =~ s/^\s+//; 
    $string =~ s/\s+$//; 
    return $string; 
} 
sub process_image { 
    my ($self, $process_image) = @_; 
    $self->{_process_image} = $process_image; 
} 
sub local { 
    my ($self, $local) = @_; 
    $self->{_local} = $local; 
} 
sub max_link { 
    my ($self, $max_link) = @_; 
    $self->{_max_link} = $max_link; 
} 
sub x_more { 
    my ($self, $x_more) = @_; 
    $self->{_x_more} = $x_more; 
} 
sub resolve_href { 
    my ($base, $href) = @_; 
    my $uri = URI->new($href); 
    return $uri->rel($base);  
} 
sub write { 
    my ($self, $ref, $data) = @_; 
    open FILE, '>c:/perlscripts/' . $ref . '_' . $self->{_process_image} . '.txt'; 
    foreach($data) { 
     print FILE $self->trim($_) . "\n"; 
    } 
    close(FILE); 
} 
sub scrape { 
    my @m_error_array; 
    my @m_href_array; 
    my @href_array; 
    my ($self, $DBhost, $DBuser, $DBpass, $DBname) = @_; 
    my ($dbh, $query, $result, $array); 
    my $DNS = "dbi:mysql:$DBname:$DBhost:3306"; 
    $dbh = DBI->connect($DNS, $DBuser, $DBpass) or die $DBI::errstr; 
    if(defined($self->{_process_image}) && (-e 'c:/perlscripts/href_w_' . $self->{_process_image} . ".txt")) { 
     open ERROR_W, "<c:/perlscripts/error_w_" . $self->{_process_image} . ".txt"; 
     open M_HREF_W, "<c:/perlscripts/m_href_w_" . $self->{_process_image} . ".txt"; 
     open HREF_W, "<c:/perlscripts/href_w_" . $self->{_process_image} . ".txt"; 
     @m_error_array = <ERROR_W>; 
     @m_href_array = <M_HREF_W>; 
     @href_array = <HREF_W>; 
     close (ERROR_W); 
     close (M_HREF_W); 
     close (HREF_W); 
    }else{ 
     @href_array = ($self->{_url}); 
    } 
    my $z = 0; 
    while(@href_array){ 
     if(defined($self->{_x_more}) && $z == $self->{_x_more}) { 
      last; 
     } 
     if(defined($self->{_process_image})) { 
      $self->write('m_href_w', @m_href_array); 
      $self->write('href_w', @href_array); 
      $self->write('error_w', @m_error_array); 
     } 
     $self->{_link_count} = scalar @m_href_array; 
     my $href = shift(@href_array); 
     my $info = URI::http->new($href); 
     my $host = $info->host; 
     $host =~ s/^www\.//; 
     $result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')"); 
     if(! $result->execute()){ 
      $result = $dbh->prepare("CREATE TABLE `" . $host . "` (`ID` INT(255) NOT NULL AUTO_INCREMENT , `URL` VARCHAR(255) NOT NULL , PRIMARY KEY (`ID`)) ENGINE = MYISAM ;"); 
      $result->execute(); 
     } 
     $self->{_current_page} = $href; 
     my $response = $ua->get($href); 
     my $responseCode = $response->code; 
     if($responseCode ne "404") { 
      my $content = LWP::Simple->get($href); 
      die "get failed: " . $href if (!defined $content); 
     } 
     #print $responseCode; 
    } 
} 
1; 

#$query = "SELECT * FROM `actwebdesigns.co.uk` ORDER BY ID DESC"; 
#$result = $dbh->prepare($query); 
#$result->execute(); 
#while($array = $result->fetchrow_hashref()) { 
# print $array->{'URL'} . "\n"; 
#} 

編輯:

仍然無法處理重定向。

my $redirect_limit = 10; 
    my $y = 0; 
    while(1 && $y le $redirect_limit) { 
     my $response = $ua->get($href); 
     my $responseCode = $response->code; 
     if($responseCode == 200 || $responseCode == 301 || $responseCode == 302) { 
      if($responseCode == 301 || $responseCode == 302) { 
       $href = $response->header('Location'); 
      }else{ 
       last; 
      } 
     }else{ 
      push(@m_error_array, $href); 
      last; 
     } 
     $y++; 
    } 
    if($y ne $redirect_limit) { 
     if(! defined($self->{_url_list})) { 
      my @url_list = ($href); 
     }else{ 
      my @url_list = $self->{_url_list}; 
      push(@url_list, $href); 
      $self->{_url_list} = @url_list; 
     } 
     my $content = LWP::Simple->get($href); 
     die "get failed: " . $href if (!defined $content); 

     #$result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')"); 
     #if(! $result->execute()){ 
     # $result = $dbh->prepare("CREATE TABLE `" . $host . "` (`ID` INT(255) NOT NULL AUTO_INCREMENT , `URL` VARCHAR(255) NOT NULL , PRIMARY KEY (`ID`)) ENGINE = MYISAM ;"); 
     # $result->execute(); 
     #} 
     print "good"; 
    }else{ 
     push(@m_error_array, $href); 
    } 
+1

你爲什麼要抓取的網頁兩次?使用'$ response-> content'而不是'LWP :: Simple-> get($ href)''。 – cjm 2010-09-21 07:37:27

+0

謝謝你,對錢! – 2010-09-21 07:56:29

+2

如果遇到問題,請將其縮小爲演示問題的最短示例腳本。在你的程序中刪除其他所有內容。找出哪些數據觸發問題。通常,通過這個過程,你會發現你的答案 – 2010-09-21 11:55:12

回答

2

您應該檢查響應代碼以查看發生了什麼(您已在檢查404)。我收到一個302 - 重定向。

例如:

die "get failed ($responseCode): " . $href if (!defined $content); 

結果消息:

get failed (302): http://www.google.com at goog.pl line 20. 
+0

哈,這是一個什麼東西!受信任的google.com本地化爲.co.uk ...應該知道這一點。問候! – 2010-09-21 06:24:44

+0

仍然無法正常工作,請參閱上述修正。 – 2010-09-21 07:03:21

2

一對夫婦的想法。

1 /您似乎在使用字符串比較運算符(le,ne)來比較數字。您應該使用數字比較運算符(< =,!=)代替。

2您從LWP::UserAgent::get調用返回的值是HTTP::Response對象。明智地使用該類的「is_foo」方法可能會使您的代碼更清晰一些。

我不知道這些是否能解決您的問題。但他們會提高你的代碼的質量。

1

這是你的問題:

my $content = LWP::Simple->get($href); 

這將字符串「LWP ::簡單」作爲第一個參數「得到」。你想:

my $content = LWP::Simple::get($href); 
-1

檢查你的SELinux設置。

已啓用SELINUX的系統將不允許來自Web代理(httpd)的傳出連接。

此頁面可以告訴你更多關於SELinux和HTTPD設置: http://wiki.centos.org/TipsAndTricks/SelinuxBooleans

在Perl腳本啓動Apache的出站Web連接:

# setsebool -P httpd_can_network_connect on 
+0

如果您正在回答一個問題,請評論原因。 – 2013-02-12 22:33:39

相關問題