提高LWP ::簡單的Perl性能

我一直負責閱讀一個網頁，提取（用HTML :: TokeParser簡單的東西）從該網頁的鏈接。然後，他（我的老闆）堅持我從這些鏈接中讀取並從每個頁面中獲取一些細節，並將所有這些信息解析爲一個xml文件，稍後可以讀取。

所以，我可以像這樣相當簡單此設置：

#!/usr/bin/perl -w 

use  strict; 
use  LWP::Simple; 
require HTML::TokeParser; 

$|=1;      # un buffer 

my $base = 'http://www.something_interesting/'; 
my $path = 'http://www.something_interesting/Default.aspx'; 
my $rawHTML = get($path); # attempt to d/l the page to mem 

my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!"; 

open (my $out, "> output.xml") or die; 

while (my $token = $p->get_tag("a")) { 

    my $url = $token->[1]{href} || "-"; 

    if ($url =~ /event\.aspx\?eventid=(\d+)/) { 
     (my $event_id = $url) =~ s/event\.aspx\?eventid=(\d+)/$1/; 
     my $text = $p->get_trimmed_text("/a"); 
     print $out $event_id,"\n"; 
     print $out $text,"\n"; 

     my $details = $base.$url; 
     my $contents = get($details); 

     # now set up another HTML::TokeParser, and parse each of those files. 

    } 
}

這很可能是確定的，如果有這個網頁上，也許5個環節。但是，我正在嘗試從〜600個鏈接讀取，並從這些頁面中獲取信息。所以，不用說，我的方法需要很長的時間...我真的不知道多久，因爲我從來沒有讓它完成。

這是我的想法，只是寫一些只需要獲取信息（例如，從你想要的鏈接查詢信息的Java應用程序）...但是，這似乎並不可接受，所以我轉向你們:)

有沒有什麼辦法可以改善這個過程？

來源

2011-06-24 Aelfhere

如果您並行執行而不是順序執行get()，您可能會看到速度提升 - 代價較小的代碼。

Parallel::ForkManager是我將開始（甚至包括LWP ::它的文檔中簡單get()例子），但也有很多其他的替代品要在CPAN發現，包括相當陳舊LWP::Parallel::UserAgent。

來源

2011-06-25 02:48:02 Kanji

這正是我所期待的，謝謝。其他答案也很有用。感謝大家的幫助:) – Aelfhere

@Aelfhere，我打算髮佈一個解決方案到您的ForkManager問題，然後再刪除它。 – ikegami

WWW::Mechanize是一個偉大的一塊工作入手，如果你正在尋找的模塊，我也建議Web::Scraper

兩人都在我提供的鏈接文檔和幫助你得到持續快速。

來源

2011-06-24 22:22:08 mrk

當它等待來自網絡的響應時，很有可能會阻止http獲取請求。使用asynchronous http library，看看它是否有幫助。

來源

2011-06-25 02:22:36 Oesor

如果您想要從服務器獲取多個項目並快速完成，請使用TCP Keep-Alive。刪除簡單的LWP::Simple，並使用keep_alive選項的常規LWP::UserAgent。這將建立一個連接緩存，所以當從同一個主機獲取更多頁面時，不會產生TCP連接建立開銷。

use strict; 
use warnings; 
use LWP::UserAgent; 
use HTTP::Request::Common; 

my @urls = @ARGV or die 'URLs!'; 
my %opts = (keep_alive => 10); # cache 10 connections 
my $ua = LWP::UserAgent->new(%opts); 
for (@urls) { 
     my $req = HEAD $_; 
     print $req->as_string; 
     my $rsp = $ua->request($req); 
     print $rsp->as_string; 
} 

my $cache = $ua->conn_cache; 
my @conns = $cache->get_connections; 
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket

來源

2011-06-25 09:45:12 Lumi

use strict; 
use warnings; 

use threads; # or: use forks; 

use Thread::Queue qw(); 

use constant MAX_WORKERS => 10; 

my $request_q = Thread::Queue->new(); 
my $response_q = Thread::Queue->new(); 

# Create the workers. 
my @workers; 
for (1..MAX_WORKERS) { 
    push @workers, async { 
     while (my $url = $request_q->dequeue()) { 
     $response_q->enqueue(process_request($url)); 
     } 
    }; 
} 

# Submit work to workers. 
$request_q->enqueue(@urls); 

# Signal the workers they are done.  
for ([email protected]) { 
    $request_q->enqueue(undef); 
} 

# Wait for the workers to finish. 
$_->join() for @workers; 

# Collect the results. 
while (my $item = $response_q->dequeue()) { 
    process_response($item); 
}

來源

2011-06-27 20:48:33 ikegami

你的問題是報廢更加的CPU消耗比I/O密集型。雖然這裏大多數人會建議你使用更多的CPU，但我會試圖展示Perl被用作「粘合」語言的一大優勢。大家都同意Libxml2是一款出色的XML/HTML解析器。另外，libcurl是一個很棒的下載代理。但是，在Perl世界中，許多刮板基於LWP :: UserAgent和HTML :: TreeBuilder :: XPath（與XP :: TomatParser相似，同時兼容XPath）。在這種情況下，你可以使用一個下拉更換模塊來處理下載和HTML通過解析的libcurl/libxml2的：

use LWP::Protocol::Net::Curl; 
use HTML::TreeBuilder::LibXML; 
HTML::TreeBuilder::LibXML->replace_original();

我看到了平均5倍的速度增加只是在前面加上幾個刮削器我用這3條線路保持。但是，當您使用HTML :: TokeParser時，我建議您嘗試使用Web :: Scraper :: LibXML（而不是LWP :: Protocol :: Net :: Curl，它同時影響LWP :: Simple 和 Web ::刮刀）。

來源

2012-11-26 03:26:27 creaktive

提高LWP ::簡單的Perl性能

回答

相關問題