如何使用Perl訪問JavaScript驅動的網頁的內容？

我試圖用Perl來製作一個小應用程序，以從LolKing中獲取英雄聯盟的召喚師名字。如何使用Perl訪問JavaScript驅動的網頁的內容？

的HTML代碼有像

<tr data-summonername="MatLife TriHard" class="lb_row_rank_4">

線，所以我只是有一些事情像

use strict; 
use warnings; 

use LWP::Simple; 
use HTML::Parser; 

my $find_links = HTML::Parser->new(
    start_h => [ 
    sub { 
     my ($tag, $attr) = @_; 
     if ($tag eq 'tr' and exists $attr->{'data-summonername'}) { 
     print "$attr->{'data-summonername'}\n"; 
     } 
    }, 
    "tag, attr" 
    ] 
); 

my $html = get('http://www.lolking.net/leaderboards/#/na/1') or die 'nope'; 

$find_links->parse($html);

但是這給我什麼。即使有attr=class，它也不會給我什麼。由於某些原因，我無法獲取tr元素的類。

使用$attr->{data-summonername}沒有單引號給我一些錯誤，由於連字符我想。如果我取$attr->{href}它工作得很好。

有人可以幫我嗎？

來源

2015-03-19 TheOne

無恥插頭：在Windows上，你可以[獲得使用Internet Explorer網頁內容]（http://perltricks.com/article/139/2014/12/ 11/Automated-Internet-Explorer-screenshots-using-Win32-OLE），然後使用[HTML :: TableExtract]（http://www.nu42.com/2012/04/htmltableextract-is-beautiful.html）提取您需要的信息。如果你不在Windows上，[通過Firefox獲取頁面內容]（http://perltricks.com/article/138/2014/12/8/Controlling-Firefox-from-Perl），然後使用HTML :: TableExtract '。當然，也有[PhantomJS]（http://phantomjs.org/）。 – 2015-03-19 12:02:20

問題是，該頁面的HTML大部分是由您的瀏覽器在頁面下載完成後使用JavaScript構建的。使用LWP::Simple::get只會檢索框架HTML和JavaScript代碼。你可以看到，如果你print $html而不是解析它。

通常的解決方案是使用WWW::Mechanize::Firefox，獲取已安裝的Firefox下載並構建頁面，然後可以查詢。雖然它比簡單的get複雜得多，因爲如果你還沒有安裝Firefox，你必須安裝Firefox，以及啓用遠程控制的Mozilla MozRepl插件。即使在瀏覽器完成構建之前，您仍然可能會遇到訪問頁面內容的問題，所以這並不是因爲內心的微弱。

更新

爲了您的利益，這裏是用WWW::Mechanize::Firefox的解決方案。

use strict; 
use warnings; 

use WWW::Mechanize::Firefox; 
use HTML::TreeBuilder::XPath; 

my $url = 'http://www.lolking.net/leaderboards/#/na/1'; 

my $mech = WWW::Mechanize::Firefox->new; 
my $resp = $mech->get($url); 
die $resp->status_line unless $resp->is_success; 

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->content); 

for my $node ($tree->findnodes('//tr[starts-with(@class, "lb_row_rank")]')) { 
    printf "Rank %2d: %s\n", 
     $node->attr('class') =~ /(\d+)/, 
     $node->attr('data-summonername'); 
}

輸出

Rank 1: Doublelift 
Rank 2: F5 Veritas 
Rank 3: Life Love Live 
Rank 4: MatLife TriHard 
Rank 5: TDK Kyle 
Rank 6: Liquid FeniX 
Rank 7: Liquid Inori TV 
Rank 8: dawoofsclaw 
Rank 9: who is he 
Rank 10: Ohhhq

來源

2015-03-19 11:39:20 Borodin

如何使用Perl訪問JavaScript驅動的網頁的內容？

回答

相關問題