2013-01-09 41 views
3

在下面的例子中,我想讓anemone只在根URL(example.com)上執行。我不確定我是否應該使用on_page_like方法,如果是的話,我需要什麼樣的模式。如何只用Anemone「抓取」根URL?

require 'anemone' 
    Anemone.crawl("http://www.example.com/") do |anemone| 
     anemone.on_pages_like(???) do |page| 
     # some code to execute 
     end 
    end 

回答

6
require 'anemone' 
Anemone.crawl("http://www.example.com/", :depth_limit => 1) do |anemone| 
    # some code to execute 
end 

您還可以指定在選項哈希下面,下面是默認值:

# run 4 Tentacle threads to fetch pages 
:threads => 4, 
# disable verbose output 
:verbose => false, 
# don't throw away the page response body after scanning it for links 
:discard_page_bodies => false, 
# identify self as Anemone/VERSION 
:user_agent => "Anemone/#{Anemone::VERSION}", 
# no delay between requests 
:delay => 0, 
# don't obey the robots exclusion protocol 
:obey_robots_txt => false, 
# by default, don't limit the depth of the crawl 
:depth_limit => false, 
# number of times HTTP redirects will be followed 
:redirect_limit => 5, 
# storage engine defaults to Hash in +process_options+ if none specified 
:storage => nil, 
# Hash of cookie name => value to send with HTTP requests 
:cookies => nil, 
# accept cookies from the server and send them back? 
:accept_cookies => false, 
# skip any link with a query string? e.g. http://foo.com/?u=user 
:skip_query_strings => false, 
# proxy server hostname 
:proxy_host => nil, 
# proxy server port number 
:proxy_port => false, 
# HTTP read timeout in seconds 
:read_timeout => nil 

我個人的經驗是,海葵是不是非常快,有很多的極端情況。文檔缺乏(正如你所經歷的),作者似乎沒有在維護這個項目。因人而異。我很快嘗試Nutch,但沒有發揮出色,但似乎更快。沒有基準,對不起。

+0

Ty晴朗。我擔心你是對的,你提到的與我對海葵的經歷一致。 –