下面的代碼是一個測試,以測試我已經用我新發現的線程知識所做的事情。爲什麼我的腳本沒有處理數組中的所有元素?
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
use URI;
use URI::http;
use File::Basename;
use DBI;
use HTML::Parser;
use LWP::Simple;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
$ua->max_redirect(0);
print "Starting main program\n";
my @urls = ('http://www.actwebdesigns.co.uk', 'http://www.1st4pets.com', 'http://www.special4you.com');
my @threads;
while (@urls) {
my $url = shift (@urls);
my $t = threads->new(\&scan, $url);
push(@threads,$t);
}
while (@threads) {
my $url_thread = shift(@threads)->join;
}
sub resolve_href {
my ($base, $href) = @_;
my $u = URI->new_abs($href, $base);
return $u->canonical;
}
sub redirect_test {
my $url = shift;
my $redirect_limit = 10;
my $y = 0;
my($response, $responseCode);
while(1 && $y le $redirect_limit) {
$response = $ua->get($url);
$responseCode = $response->code;
if($responseCode == 200 || $responseCode == 301 || $responseCode == 302) {
if($responseCode == 301 || $responseCode == 302) {
$url = resolve_href($url, $response->header('Location'));
}else{
last;
}
}else{
last;
}
$y++;
}
return ($url, $response, $responseCode, $redirect_limit, $y);
}
sub scan {
my $url = shift;
my @hrefs_found;
print "started scanning: $url\n";
my $info = URI::http->new($url);
# if url is not an absolute url
if(! defined($info->host)) {
print "Invalid URL: $url \n";
}else{
my $host = $info->host;
$host =~ s/^www\.//;
# check to see if url is valid, checks for redirects (max of 10)
my @urlI = redirect_test($url);
my $content = '';
# checks to see if url did not redirect more than 10 times and that response returned was 200
if($urlI[4] != $urlI[3] && $urlI[2] == 200) {
$content = $urlI[1]->content;
die "get failed: " . $urlI[0] if (! defined $content);
}
# sticks all hrefs on a page in an array
my @pageLinksArray = ($content =~ m/href=["']([^"']*)["']/g);
# foreach links found
foreach(@pageLinksArray) {
# make href an absolute url
my $url_found = resolve_href($urlI[0], $_);
# check if url looks like a valid url
if($url_found =~ m/^http:\/\//) {
my $info = URI::http->new($url_found);
# check to see if url is a valid url
if(! defined($info->host)) {
print "Invalid URL: $url_found \n";
}else{
my %values_index;
@values_index{@hrefs_found} =();
my %values_index2;
@values_index2{@urls} =();
# if url is not already been found
if(! exists $values_index{$url_found} && ! exists $values_index2{$url_found}) {
# add to arrays
push(@hrefs_found, $url_found);
push(@urls, $url_found);
}
}
}
}
print "$url found " . scalar @hrefs_found . "\n";
}
return $url;
}
的問題是,靠近該腳本的新發現的網址被添加到陣列中,但在腳本的頂部代碼的端部不處理它們,即它僅通過第一測試網址去。
任何人都可以看到爲什麼會發生這種情況嗎?
問候,
菲爾
編輯* *
我試圖做這樣的事情暫停:
while (@urls) {
my $url = shift (@urls);
my $t = threads->new(\&scan, $url);
push(@threads,$t);
my $n = 0;
while(1) {
if(scalar @urls == 1) {
sleep 10;
}else{
last;
}
if($n >= 1) {
print "IN ARRAY URLS:\n\n";
print @urls;
print "\n\n";
die "Process taking too long.";
last;
}
$n++;
}
}
但它似乎沒有做任何事情。
的結果是:
Starting main program
started scanning: http://www.actwebdesigns.co.uk
started scanning: http://www.1st4pets.com
http://www.actwebdesigns.co.uk found 24
http://www.1st4pets.com found 17
IN ARRAY URLS:
http://www.stackoverflow.com
Process taking too long. at C:\perlscripts\thread.pl line 38.
Perl exited with active threads:
0 running and unjoined
2 finished and unjoined
0 running and detached
凹凸,任何人? – 2010-09-24 11:34:54
如果您將問題簡化爲能夠演示問題的最小腳本,則幫助您更容易。大多數人不會想要通過你的漫長而複雜的腳本去處理各種各樣的事情。如果你提出更好的問題,更多的人會對幫助你感興趣。 – 2010-09-24 15:45:23