2011-02-25 48 views
0

好晚上親愛的社區!LWP ::簡單 - 如何實現一個循環到它[現場演示]

我想處理多個網頁,有點像網絡蜘蛛/爬蟲可能。我有一些 - 但現在我需要有一些改進的蜘蛛邏輯。查看目標網址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

此頁面已獲得6000多個結果!那麼我如何獲得所有結果? 我使用該模塊LWP ::簡單,我需要有一些改進的論點,我可以使用以便獲得所有的6150個記錄

嘗試:這裏是前5頁網址:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200 

我們可以看到URL中的「s」屬性從第1頁的0開始,然後每頁增加50 。我們可以使用這些信息來創建一個循環:

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
    my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
    #process pageurl 
} 

tadmc(非常非常支持用戶)創造了推出一個CVS-格式化結果一個偉大的劇本。我有建立在這個循環中的代碼:(注 - 我猜,有出問題了什麼見下面的思索......與代碼段和錯誤的消息:

#!/usr/bin/perl 
use warnings; 
use strict; 
use LWP::Simple; 
use HTML::TableExtract; 
use Text::CSV; 

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
    my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
    #process pageurl 
} 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

my $te = new HTML::TableExtract(); 
$te->parse($html); 

my @cols = qw(
    rownum 
    number 
    name 
    phone 
    type 
    website 
); 

my @fields = qw(
    rownum 
    number 
    name 
    street 
    postal 
    town 
    phone 
    fax 
    type 
    website 
); 

my $csv = Text::CSV->new({ binary => 1 }); 

foreach my $ts ($te->table_states) { 
    foreach my $row ($ts->rows) { 

trim leading/trailing whitespace from base fields 
     s/^s+//, s/\s+$// for @$row; 

load the fields into the hash using a "hash slice" 
     my %h; 
     @h{@cols} = @$row; 

derive some fields from base fields, again using a hash slice 
     @h{qw/name street postal town/} = split /n+/, $h{name}; 
     @h{qw/phone fax/} = split /n+/, $h{phone}; 

trim leading/trailing whitespace from derived fields 
     s/^s+//, s/\s+$// for @h{qw/name street postal town/}; 

     $csv->combine(@h{@fields}); 
     print $csv->string, "\n"; 
    } 
} 

已經有一些問題 - 我犯了一個錯誤我想這錯誤是在這裏:。

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
    } 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

我寫下某種雙重 - 代碼,我需要離開了一部分......這一個在這裏

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

查看結果i n命令行:

[email protected]:~> cd perl 
[email protected]:~/perl> perl bavaria_all_.pl 
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. 
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. 
syntax error at bavaria_all_.pl line 59, near "/," 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. 
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. 
Substitution replacement not terminated at bavaria_all_.pl line 63. 
[email protected]:~/perl> 

你覺得怎麼樣? 期待聽到您

順便說一句 - 看到代碼,通過tadmc創建的,沒有任何改善蜘蛛邏輯....這將運行非常非常nciely - 沒有任何問題:它吐出了一個很好的格式化的CVS -Output!

#!/usr/bin/perl 
use warnings; 
use strict; 
use LWP::Simple; 
use HTML::TableExtract; 
use Text::CSV; 

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; 
$html =~ tr/r//d;  # strip the carriage returns 
$html =~ s/&nbsp;/ /g; # expand the spaces 

my $te = new HTML::TableExtract(); 
$te->parse($html); 

my @cols = qw(
    rownum 
    number 
    name 
    phone 
    type 
    website 
); 

my @fields = qw(
    rownum 
    number 
    name 
    street 
    postal 
    town 
    phone 
    fax 
    type 
    website 
); 

my $csv = Text::CSV->new({ binary => 1 }); 

foreach my $ts ($te->table_states) { 
    foreach my $row ($ts->rows) { 

trim leading/trailing whitespace from base fields 
     s/^s+//, s/\s+$// for @$row; 

load the fields into the hash using a "hash slice" 
     my %h; 
     @h{@cols} = @$row; 

derive some fields from base fields, again using a hash slice 
     @h{qw/name street postal town/} = split /n+/, $h{name}; 
     @h{qw/phone fax/} = split /n+/, $h{phone}; 

trim leading/trailing whitespace from derived fields 
     s/^s+//, s/\s+$// for @h{qw/name street postal town/}; 

     $csv->combine(@h{@fields}); 
     print $csv->string, "\n"; 
    } 
} 

注意:上面提到的代碼運行得很好 - 它吐出了csv格式的輸出。

+1

您是否期望我們能夠計算出哪一行是第52行?我太忙了... – tadmc 2011-02-26 00:14:08

+0

你好親愛的tadmc - 不,我不!這裏第52行 s/^ s + //,s/\ s + $ //爲@ $行; – zero 2011-02-26 00:18:47

+1

它缺少一個反斜槓。錯誤是因爲您刪除了註釋標記(#),所以註釋被解釋爲Perl代碼... – tadmc 2011-02-26 02:24:23

回答

1

實現分頁的另一種方法是從頁面中提取所有URL並檢測分頁器URL。

... 
for (@urls) { 
    if (is_pager_url($_) and not exists $seen{$_}) { 
     push @pager_url, $_; 
     $seen{$_}++; 
    } 
} 
... 

sub is_pager_url { 
    my ($url) = @_; 
    return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+}; 
} 

這樣你就不必處理遞增計數器或建立總頁數。它也適用於a和s的不同值。通過保持百分比哈希值,您可以便宜地避免區分prev和下一頁。

+0

你好Vipul Ved Prakash - 這看起來非常非常令人印象深刻。今天晚些時候我會試試這個!非常感謝張貼! – zero 2011-02-26 00:49:50

+0

我試圖找出如何應用這種方法....它看起來有趣和迷人! – zero 2011-02-26 01:06:07

1

非常好!我一直在等你找出如何獨立獲取多個頁面!

1)將我的代碼放入頁面獲取循環的(將「}」方式移動到最後)。

2)$ html = get $ pageurl; #改變這個來使用你的新URL:

3)把我的反斜槓放回原來的位置:tr/\ r // d;

+0

你好親愛的tadmc!很高興聽到你的消息!!我很高興! - 如你所見,我有一些麻煩。我想我必須發表一些評論。這是你想要從我這裏得到的嗎?我現在試着再次閱讀你的評論1.to 3,並試圖詳細瞭解這些... – zero 2011-02-26 00:06:59