機械化可以使這變得更容易嗎？

在this後我得知在該特定示例中，Ruby/Perl中的Mechanize比HTML::TreeBuilder 3更易於使用。機械化可以使這變得更容易嗎？

是Mechanize優於HTML::TokeParser？

下面在使用Mechanize的Ruby中編寫代碼是否更容易？

sub get_img_page_urls { 
    my $url = shift; 

    my $ua = LWP::UserAgent->new; 
    $ua->agent("$0/0.1 " . $ua->agent); 
    $ua->agent("Mozilla/8.0"); 

    my $req = new HTTP::Request 'GET' => "$url"; 
    $req->header('Accept' => 'text/html'); 

    $response_u = $ua->request($req); # send request 

    die "Error: ", $response_u->status_line unless $response_u->is_success; 

    my $stream = HTML::TokeParser->new(\$response_u->content); 

    my %urls =(); 

    my $found_thumbnails = 0; 
    my $found_thumb = 0; 

    while (my $token = $stream->get_token) { 

     # <div class="thumb-box" ... > 
     if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') { 
      $found_thumbnails = 1; 
     } 

     # <div class="thumb" ... > 
     if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') { 
      $found_thumb = 1; 
     } 

     #           <a ... > 
     if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') { 
      $urls{'http://example.com' . "$token->[2]{href}"} = 1; 

      # one url have been found. Now start all over. 
      $found_thumb = 0; 
      $found_thumbnails = 0; 
     } 

    } 

    return %urls; 
}

來源

2011-12-10 Sandra Schlichting

我注意到的第一件事是，你應該使用真/假，而不是1/0，因爲在Ruby中0的值爲true的習慣得到。 – pguardiario

這與您的其他問題幾乎完全相同。機械化不是解析器，所以你不能將它與TokeParser進行比較。（但恕我直言，任何現代的DOM解析器都會比TokeParser更優秀）。是的，使用Ruby編寫代碼比較容易，無論是否使用Mechanize。（這個代碼在Perl中也可以更簡單BTW） –

機械化比解析器更多。它增加了一個模擬瀏覽器，它可以讓你導航一個網站，填寫表格等，但它也包括一個解析器，使網絡抓取非常簡單。這是你的重寫方法使用Ruby機械化：

def get_img_page_urls(url) 
    agent = Mechanize.new 
    agent.user_agent_alias = "Windows Mozilla" 
    agent.get(url).search("//div[@class='thumb-box']/div[@class='thumb']/a/@href") 
end

來源

2011-12-11 01:21:03

不知道你需要使用機械化，因爲我認爲Nokogiri就足夠了。我不知道perl的，所以我不完全確定的HTML是如何在示例中的佈局，但我假設是這樣的：

<div class="thumb-box"> 
    ... 
    <div class="thumb"> 
    ... 
    <a href="http://example.com/img/5.jpg">... 
    </div> 
</div>

下面是與引入nokogiri代碼：

require 'nokogiri' 
require 'open-uri' 

def get_img_page_urls(url) 
    urls = [] 
    doc = Nokogiri::HTML(open('http://www.example.com', 'User-Agent' => 'Mozilla/8.0')) 
    doc.css('div.thumb-box div.thumb a').each do |link| 
    urls << link.attr("href") 
    end 

    urls 
end

來源

2011-12-11 00:19:36

+1 + Ruby + Nokogiri。這可以寫得更短：'Nokogiri :: HTML（open（'http://www.example.com'））.css（'div.thumb-box div.thumb a'）。map（＆href） ' –

您可以添加如何設置自定義用戶代理字符串，以便程序更接近原始代碼的行爲嗎？ – daxim

您可以鏈接[：href]或鏈接['href']而不是link.attr（「href」） – pguardiario

什麼比HTML :: TokeParser好，談論接口。 WWW :: Mechanize閃耀着形式，但它也缺乏一種聲明性的方式來尋找某些元素。我喜歡Web::Query和HTML::Query，他們在jQuery之後建模他們的接口，據我所知，這種編程流行。

該問題的程序更短如下。它會自動引發異常，所以不需要明確的錯誤處理。

use URI; 
use Web::Query 'wq'; 

sub get_img_page_urls { 
    my ($url) = @_; 
    $Web::Query::UserAgent = LWP::UserAgent->new(agent => 'Mozilla/8.0'); 

    return map { 
     URI->new($_)->abs('http://example.com')->as_string # hash key 
     => 1             # hash value 
    } wq($url)->find('div.thumb-box div.thumb a')->attr('href'); 
}

此前張貼評論https://stackoverflow.com/q/8274221#comment-10196381

來源

2011-12-11 00:33:57 daxim

機械化可以使這變得更容易嗎？

回答

相關問題