好晚上親愛的社區!LWP ::簡單 - 如何實現一個循環到它[現場演示]
我想處理多個網頁,有點像網絡蜘蛛/爬蟲可能。我有一些 - 但現在我需要有一些改進的蜘蛛邏輯。查看目標網址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
此頁面已獲得6000多個結果!那麼我如何獲得所有結果? 我使用該模塊LWP ::簡單,我需要有一些改進的論點,我可以使用以便獲得所有的6150個記錄
嘗試:這裏是前5頁網址:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
我們可以看到URL中的「s」屬性從第1頁的0開始,然後每頁增加50 。我們可以使用這些信息來創建一個循環:
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
tadmc(非常非常支持用戶)創造了推出一個CVS-格式化結果一個偉大的劇本。我有建立在這個循環中的代碼:(注 - 我猜,有出問題了什麼見下面的思索......與代碼段和錯誤的消息:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
已經有一些問題 - 我犯了一個錯誤我想這錯誤是在這裏:。
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
我寫下某種雙重 - 代碼,我需要離開了一部分......這一個在這裏
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
查看結果i n命令行:
[email protected]:~> cd perl
[email protected]:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
[email protected]:~/perl>
你覺得怎麼樣? 期待聽到您
順便說一句 - 看到代碼,通過tadmc創建的,沒有任何改善蜘蛛邏輯....這將運行非常非常nciely - 沒有任何問題:它吐出了一個很好的格式化的CVS -Output!
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
注意:上面提到的代碼運行得很好 - 它吐出了csv格式的輸出。
您是否期望我們能夠計算出哪一行是第52行?我太忙了... – tadmc 2011-02-26 00:14:08
你好親愛的tadmc - 不,我不!這裏第52行 s/^ s + //,s/\ s + $ //爲@ $行; – zero 2011-02-26 00:18:47
它缺少一個反斜槓。錯誤是因爲您刪除了註釋標記(#),所以註釋被解釋爲Perl代碼... – tadmc 2011-02-26 02:24:23