爲什麼Web ::刮刀這麼慢？

我使用Web :: Scraper從一個非常簡單的表中抽出一些數據並將其轉換爲我的需要。我也使用WWW :: Mechanize來完成提交表單，而這個表單並不慢。爲什麼Web ::刮刀這麼慢？

一旦我開始使用Web :: Scraper，我發現需要很長時間才能從頁面返回數據。剖析顯示以下內容：

6299228 13.7s line XML/XPathEngine/Step.pm 
7335 10.9s line Net/HTTP/Methods.pm 
3990690 10.4s line XML/XPathEngine/NodeSet.pm 
2690467 7.72s line HTML/TreeBuilder/XPath.pm 
2047085 5.70s line XML/XPathEngine/Function.pm 
978212 3.37s line XML/XPathEngine/Literal.pm 
1791592 3.29s line HTML/Element.pm 
661985 3.15s line XML/XPathEngine.pm 
1997421 2.52s line XML/XPathEngine/Expr.pm

在控制檯上運行它產生以下：

real 0m28.042s 
user 0m11.312s 
sys  0m0.121s

使用Web瀏覽器結構形式（調試）我只能看見3.5秒自定義查詢，所以我已經縮小到Web :: Scraper花時間了。

下面是一些網頁刷屏代碼，即：

$offers = scraper { 
     process 'table> tr' => 'td[]' => scraper { 
     process 'td.tdCallNumber > strong ' => 'tdCallNumber' => 'TEXT'; 
     process 'td.tdDateReceived >strong ' => 'tdDateReceived' => 'TEXT'; 
     process 'td.tdTimeReceived >strong' => 'tdTimeReceived' => 'TEXT'; 
     process 'td.tdLocation>strong'  => 'tdLocation'  => 'TEXT'; 
     process 'td.tdDesc>strong'   => 'tdDesc'   => 'TEXT'; 
     process 'td > table '    => 'table'   => 'TEXT'; 
     process 'td>table>tr' => 'data[]' => scraper { 
      process 'td.tdUnit' => 'tdUnit' => 'TEXT', 
       process 'td.tdDIS' => 'tdDIS' => 'TEXT', 
       process 'td.tdENR' => 'tdENR' => 'TEXT', 
       process 'td.tdONS' => 'tdONS' => 'TEXT', 
       process 'td.tdLEF' => 'tdLEF' => 'TEXT', 
       process 'td.tdARR' => 'tdARR' => 'TEXT', 
       process 'td.tdBUS' => 'tdBUS' => 'TEXT', 
       process 'td.tdREM' => 'tdREM' => 'TEXT', 
       process 'td.tdCOM' => 'tdCOM' => 'TEXT', 
       ; 
     }; 

    } 
}; 
my $D; 
my $print_header = 1; 

$D = $offers->scrape($text);

...

一些更多的它是將其轉換爲基於HTML的輸出（幾乎相同的表格形式）。

my $r; 
for $r (@{ $D->{td} || [] }) { 
    if ($r->{tdCallNumber}) { 
     if ($print_header) { 
      $npage .= " 

$r->{tdCallNumber}, $r->{tdDateReceived}, $r->{tdTimeReceived}, 
      $r->{tdLocation}, $r->{tdDesc}; 
    } 
    if ($r->{data}) { 
     $npage .= '

有什麼我可以做的，以提高速度？

來源

2013-09-27 No Way

你在代碼中執行其他任何事情嗎？也許最好是發佈一個示例代碼，人們也可以查看代碼並且可能會與您一起調試，以查看可能存在的問題。不要忘記從'你<>站點'測量平均時間，這也意味着緩慢以及其他事情。 – Prix

我添加了一些代碼，而且我沒有做任何複雜的事情。使用WWW：機械化登錄到站點，將數據輸入到表單中，使用Web :: Scaper獲取數據並將其轉換爲本地表單。 –

我會檢查出硒的網站刮需要..其非常真棒 – qwwqwwq

您可以使用NYTProf在程序或庫中找到確切的緩慢位置。一旦你看到什麼是緩慢的，那麼你可以改善它。

http://www.slideshare.net/Tim.Bunce/develnytprof-200907

# profile code and write database to ./nytprof.out 
perl -d:NYTProf some_perl.pl 

# convert database into a set of html files, e.g., ./nytprof/index.html 
# and open a web browser on the nytprof/index.html file 
nytprofhtml --open

來源

2013-09-27 11:11:26 user1126070

也許你可以看看HTML::TreeBuilder::LibXML。模塊文件談到HTML::TreeBuilder::XPath對於大型文檔很慢，並且實現了「足夠的方法......所以像Web :: Scraper這樣的模塊工作」。文檔頁面上的基準測試顯示，libxml變體比純perl變體快大約1600％。

來源

2013-09-27 13:40:21

爲什麼Web ::刮刀這麼慢？

回答

相關問題