2012-02-21 31 views
0

由於本網站的幫助,我在Perl上取得了一些進展,但我遇到了一個問題。我從中刮過的一頁已經改變了,我現在無法弄清楚如何去做。我想要做的是存儲一個鏈接到我想要去的每個頁面。問題在於這些鏈接位於源代碼中的href屬性標記內,我不知道如何提取它們。任何人都可以幫我嗎?Perl WWW:機械化/ HTML:TokeParser和關注/存儲來自href attr的URL

我需要的聯繫是從線316這個頁(源代碼)354 http://www.soccerbase.com/teams/home.sd

我需要基本提取鏈接變量我在其他腳本中使用。正如我所提到的,我使用WWW :: Mechanize和HTML :: TokeParser,希望這些方法可以使用但目前無法解決。提前致謝!

回答

0

請參閱method find_all_links in WWW::Mechanize。無需使用解析器手動打擾。你可能想放鬆正則表達式,這樣你就可以同時獲得大約1000個可能的團隊。

use WWW::Mechanize qw(); 
my $w = WWW::Mechanize->new; 
$w->get('http://www.soccerbase.com/teams/home.sd'); 
for my $link ($w->find_all_links(url_regex => qr/comp_id=1\b/)) { 
    # 20 instances of WWW::Mechanize::Link 
    printf "URL=%s\tTeam=%s\n", $link->url_abs, $link->text 
} 

URL=http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1  Team=Premier League 
URL=http://www.soccerbase.com/teams/team.sd?team_id=142&comp_id=1  Team=Arsenal 
URL=http://www.soccerbase.com/teams/team.sd?team_id=154&comp_id=1  Team=Aston Villa 
URL=http://www.soccerbase.com/teams/team.sd?team_id=308&comp_id=1  Team=Blackburn 
URL=http://www.soccerbase.com/teams/team.sd?team_id=354&comp_id=1  Team=Bolton 
URL=http://www.soccerbase.com/teams/team.sd?team_id=536&comp_id=1  Team=Chelsea 
URL=http://www.soccerbase.com/teams/team.sd?team_id=942&comp_id=1  Team=Everton 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1055&comp_id=1  Team=Fulham 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1563&comp_id=1  Team=Liverpool 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1718&comp_id=1  Team=Man City 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1724&comp_id=1  Team=Man Utd 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1823&comp_id=1  Team=Newcastle 
URL=http://www.soccerbase.com/teams/team.sd?team_id=1855&comp_id=1  Team=Norwich 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2093&comp_id=1  Team=QPR 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2477&comp_id=1  Team=Stoke 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2493&comp_id=1  Team=Sunderland 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2513&comp_id=1  Team=Swansea 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2590&comp_id=1  Team=Tottenham 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2744&comp_id=1  Team=West Brom 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2783&comp_id=1  Team=Wigan 
URL=http://www.soccerbase.com/teams/team.sd?team_id=2848&comp_id=1  Team=Wolves 
+0

再次你已經解決了我的問題,太感謝你了! – blacky 2012-02-22 01:02:25

+0

好吧,我遇到了一個小問題,當我嘗試運行它時,出現此錯誤: 無法通過軟件包「HTTP :: Headers」在/ usr/local找到對象方法「find_all_links」 /ActivePerl-5.14/lib/HTTP/Message.pm 649行。 任何想法爲什麼? – blacky 2012-02-22 13:18:02

+0

所有我需要從這是字符串數組中的url,似乎無法得到它的工作,也得到此錯誤:無法找到對象方法「url_abs」通過包「WWW :: Mechanize」在solution.pl線11. – blacky 2012-02-22 14:31:55