如何只用Anemone「抓取」根URL？

在下面的例子中，我想讓anemone只在根URL（example.com）上執行。我不確定我是否應該使用on_page_like方法，如果是的話，我需要什麼樣的模式。如何只用Anemone「抓取」根URL？

require 'anemone' 
    Anemone.crawl("http://www.example.com/") do |anemone| 
     anemone.on_pages_like(???) do |page| 
     # some code to execute 
     end 
    end

來源

2013-01-09 Jackson Henley

require 'anemone' 
Anemone.crawl("http://www.example.com/", :depth_limit => 1) do |anemone| 
    # some code to execute 
end

您還可以指定在選項哈希下面，下面是默認值：

# run 4 Tentacle threads to fetch pages 
:threads => 4, 
# disable verbose output 
:verbose => false, 
# don't throw away the page response body after scanning it for links 
:discard_page_bodies => false, 
# identify self as Anemone/VERSION 
:user_agent => "Anemone/#{Anemone::VERSION}", 
# no delay between requests 
:delay => 0, 
# don't obey the robots exclusion protocol 
:obey_robots_txt => false, 
# by default, don't limit the depth of the crawl 
:depth_limit => false, 
# number of times HTTP redirects will be followed 
:redirect_limit => 5, 
# storage engine defaults to Hash in +process_options+ if none specified 
:storage => nil, 
# Hash of cookie name => value to send with HTTP requests 
:cookies => nil, 
# accept cookies from the server and send them back? 
:accept_cookies => false, 
# skip any link with a query string? e.g. http://foo.com/?u=user 
:skip_query_strings => false, 
# proxy server hostname 
:proxy_host => nil, 
# proxy server port number 
:proxy_port => false, 
# HTTP read timeout in seconds 
:read_timeout => nil

我個人的經驗是，海葵是不是非常快，有很多的極端情況。文檔缺乏（正如你所經歷的），作者似乎沒有在維護這個項目。因人而異。我很快嘗試Nutch，但沒有發揮出色，但似乎更快。沒有基準，對不起。

來源

2013-01-09 03:28:40 sunnyrjuneja

Ty晴朗。我擔心你是對的，你提到的與我對海葵的經歷一致。 –

如何只用Anemone「抓取」根URL？

回答

相關問題