LWP ::簡單 - 如何實現一個循環到它[現場演示]

我想處理多個網頁，有點像網絡蜘蛛/爬蟲可能。我有一些 - 但現在我需要有一些改進的蜘蛛邏輯。查看目標網址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

此頁面已獲得6000多個結果！那麼我如何獲得所有結果？我使用該模塊LWP ::簡單，我需要有一些改進的論點，我可以使用以便獲得所有的6150個記錄

嘗試：這裏是前5頁網址：

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我們可以看到URL中的「s」屬性從第1頁的0開始，然後每頁增加50 。我們可以使用這些信息來創建一個循環：

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
    my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
    #process pageurl 
}

tadmc（非常非常支持用戶）創造了推出一個CVS-格式化結果一個偉大的劇本。我有建立在這個循環中的代碼：（注 - 我猜，有出問題了什麼見下面的思索......與代碼段和錯誤的消息：

#!/usr/bin/perl 
use warnings; 
use strict; 
use LWP::Simple; 
use HTML::TableExtract; 
use Text::CSV; 

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
    my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
    #process pageurl 
} 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

my $te = new HTML::TableExtract(); 
$te->parse($html); 

my @cols = qw(
    rownum 
    number 
    name 
    phone 
    type 
    website 
); 

my @fields = qw(
    rownum 
    number 
    name 
    street 
    postal 
    town 
    phone 
    fax 
    type 
    website 
); 

my $csv = Text::CSV->new({ binary => 1 }); 

foreach my $ts ($te->table_states) { 
    foreach my $row ($ts->rows) { 

trim leading/trailing whitespace from base fields 
     s/^s+//, s/\s+$// for @$row; 

load the fields into the hash using a "hash slice" 
     my %h; 
     @h{@cols} = @$row; 

derive some fields from base fields, again using a hash slice 
     @h{qw/name street postal town/} = split /n+/, $h{name}; 
     @h{qw/phone fax/} = split /n+/, $h{phone}; 

trim leading/trailing whitespace from derived fields 
     s/^s+//, s/\s+$// for @h{qw/name street postal town/}; 

     $csv->combine(@h{@fields}); 
     print $csv->string, "\n"; 
    } 
}

已經有一些問題 - 我犯了一個錯誤我想這錯誤是在這裏：。

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
    } 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces

我寫下某種雙重 - 代碼，我需要離開了一部分......這一個在這裏

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces

查看結果i n命令行：

[email protected]:~> cd perl 
[email protected]:~/perl> perl bavaria_all_.pl 
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. 
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. 
syntax error at bavaria_all_.pl line 59, near "/," 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. 
Substitution replacement not terminated at bavaria_all_.pl line 63. 
[email protected]:~/perl>

你覺得怎麼樣？期待聽到您

順便說一句 - 看到代碼，通過tadmc創建的，沒有任何改善蜘蛛邏輯....這將運行非常非常nciely - 沒有任何問題：它吐出了一個很好的格式化的CVS -Output！

#!/usr/bin/perl 
use warnings; 
use strict; 
use LWP::Simple; 
use HTML::TableExtract; 
use Text::CSV; 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

my $te = new HTML::TableExtract(); 
$te->parse($html); 

my @cols = qw(
    rownum 
    number 
    name 
    phone 
    type 
    website 
); 

my @fields = qw(
    rownum 
    number 
    name 
    street 
    postal 
    town 
    phone 
    fax 
    type 
    website 
); 

my $csv = Text::CSV->new({ binary => 1 }); 

foreach my $ts ($te->table_states) { 
    foreach my $row ($ts->rows) { 

trim leading/trailing whitespace from base fields 
     s/^s+//, s/\s+$// for @$row; 

load the fields into the hash using a "hash slice" 
     my %h; 
     @h{@cols} = @$row; 

derive some fields from base fields, again using a hash slice 
     @h{qw/name street postal town/} = split /n+/, $h{name}; 
     @h{qw/phone fax/} = split /n+/, $h{phone}; 

trim leading/trailing whitespace from derived fields 
     s/^s+//, s/\s+$// for @h{qw/name street postal town/}; 

     $csv->combine(@h{@fields}); 
     print $csv->string, "\n"; 
    } 
}

注意：上面提到的代碼運行得很好 - 它吐出了csv格式的輸出。

來源

2011-02-25 zero

您是否期望我們能夠計算出哪一行是第52行？我太忙了... – tadmc 2011-02-26 00:14:08

你好親愛的tadmc - 不，我不！這裏第52行 s/^ s + //，s/\ s + $ //爲@ $行; – zero 2011-02-26 00:18:47

它缺少一個反斜槓。錯誤是因爲您刪除了註釋標記（＃），所以註釋被解釋爲Perl代碼... – tadmc 2011-02-26 02:24:23

實現分頁的另一種方法是從頁面中提取所有URL並檢測分頁器URL。

... 
for (@urls) { 
    if (is_pager_url($_) and not exists $seen{$_}) { 
     push @pager_url, $_; 
     $seen{$_}++; 
    } 
} 
... 

sub is_pager_url { 
    my ($url) = @_; 
    return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+}; 
}

這樣你就不必處理遞增計數器或建立總頁數。它也適用於a和s的不同值。通過保持百分比哈希值，您可以便宜地避免區分prev和下一頁。

來源

2011-02-26 00:13:18

你好Vipul Ved Prakash - 這看起來非常非常令人印象深刻。今天晚些時候我會試試這個！非常感謝張貼！ – zero 2011-02-26 00:49:50

我試圖找出如何應用這種方法....它看起來有趣和迷人！ – zero 2011-02-26 01:06:07

非常好！我一直在等你找出如何獨立獲取多個頁面！

1）將我的代碼放入頁面獲取循環的（將「}」方式移動到最後）。

2）$ html = get $ pageurl; ＃改變這個來使用你的新URL：

3）把我的反斜槓放回原來的位置：tr/\ r // d;

來源

2011-02-26 00:04:29 tadmc

你好親愛的tadmc！很高興聽到你的消息！！我很高興！ - 如你所見，我有一些麻煩。我想我必須發表一些評論。這是你想要從我這裏得到的嗎？我現在試着再次閱讀你的評論1.to 3，並試圖詳細瞭解這些... – zero 2011-02-26 00:06:59

LWP ::簡單 - 如何實現一個循環到它[現場演示]

回答

相關問題