Ruby，Mongodb，Anemone：可能會發生內存泄漏的Web爬蟲？

我最近開始學習網絡爬蟲，並且用Ruby構建了一個示例爬蟲，用於存儲的文件是Anemone和Mongodb。我正在一個可能有數十億個鏈接的大型公共網站上測試爬蟲。Ruby，Mongodb，Anemone：可能會發生內存泄漏的Web爬蟲？

crawler.rb正在索引正確的信息，雖然當我檢查活動監視器中的內存使用情況時，它顯示內存不斷增長。我只運行了大約6-7個小時的爬蟲，內存顯示爲1.38GB的Mongo和1.37GB的Ruby進程。這似乎每小時增長大約100MB左右。

看來我可能有內存泄漏？他們是一個更優化的方式，我可以實現相同的抓取，而不會使內存升級失控，從而可以運行更長時間？

# Sample web_crawler.rb with Anemone, Mongodb and Ruby. 

require 'anemone' 

# do not store the page's body. 
module Anemone 
    class Page 
    def to_hash 
     {'url' => @url.to_s, 
     'links' => links.map(&:to_s), 
     'code' => @code, 
     'visited' => @visited, 
     'depth' => @depth, 
     'referer' => @referer.to_s, 
     'fetched' => @fetched} 
    end 
    def self.from_hash(hash) 
     page = self.new(URI(hash['url'])) 
     {'@links' => hash['links'].map { |link| URI(link) }, 
     '@code' => hash['code'].to_i, 
     '@visited' => hash['visited'], 
     '@depth' => hash['depth'].to_i, 
     '@referer' => hash['referer'], 
     '@fetched' => hash['fetched'] 
     }.each do |var, value| 
     page.instance_variable_set(var, value) 
     end 
     page 
    end 
    end 
end 


Anemone.crawl("http://www.example.com/", :discard_page_bodies => true, :threads => 1, :obey_robots_txt => true, :user_agent => "Example - Web Crawler", :large_scale_crawl => true) do | anemone | 
    anemone.storage = Anemone::Storage.MongoDB 

    #only crawl pages that contain /example in url 
    anemone.focus_crawl do |page| 
    links = page.links.delete_if do |link| 
     (link.to_s =~ /example/).nil? 
    end 
    end 

    # only process pages in the /example directory 
    anemone.on_pages_like(/example/) do | page | 
    regex = /some type of regex/ 
    example = page.doc.css('#example_div').inner_html.gsub(regex,'') rescue next 

    # Save to text file 
    if !example.nil? and example != "" 
     open('example.txt', 'a') { |f| f.puts "#{example}"} 
    end 
    page.discard_doc! 
    end 
end

來源

2012-02-22 viotech

你找出泄漏的原因嗎？如果你認爲它是海葵中的錯誤，你是否在他們的[問題跟蹤器]（https://github.com/chriskite/anemone/issues）上報告過它？ – 2012-06-21 17:43:45

關於銀蓮問題跟蹤器上提到的相關問題包括：[內存泄漏？]（https://github.com/chriskite/anemone/issues/49），[內存泄漏或內存處理效率低下]（https://github.com/chriskite/anemone/issues/29）和[修復大型網站的OutOfMemory錯誤]（https://github.com/chriskite/anemone/pull/30） – 2012-06-21 17:44:14

我在SO上發佈的同一時間報告了它。我能夠通過添加建議的修補程序來抓取我的任務所需的內容，並且它使得我的抓取時間更長，儘管tbh內存使用量穩步增長，但速度並不像以前那麼快。我仍然不確定是什麼導致了內存泄漏。 – viotech 2012-06-22 02:40:59

我也有這個問題，但我使用redis作爲數據存儲。

這是我的爬蟲：

require "rubygems" 

require "anemone" 

urls = File.open("urls.csv") 
opts = {discard_page_bodies: true, skip_query_strings: true, depth_limit:2000, read_timeout: 10} 

File.open("results.csv", "a") do |result_file| 

    while row = urls.gets 

    row_ = row.strip.split(',') 
    if row_[1].start_with?("http://") 
     url = row_[1] 
    else 
     url = "http://#{row_[1]}" 
    end 
    Anemone.crawl(url, options = opts) do |anemone| 
     anemone.storage = Anemone::Storage.Redis 
     puts "crawling #{url}"  
     anemone.on_every_page do |page| 

     next if page.body == nil 

     if page.body.downcase.include?("sometext") 
      puts "found one at #{url}"  
      result_file.puts "#{row_[0]},#{row_[1]}" 
      next 

     end # end if 

     end # end on_every_page 

    end # end crawl 

    end # end while 

    # we're done 
    puts "We're done." 

end # end File.open

我申請的補丁從here我core.rb文件中的海葵寶石：

35  # Prevent page_queue from using excessive RAM. Can indirectly limit ra te of crawling. You'll additionally want to use discard_page_bodies and/or a  non-memory 'storage' option 
36  :max_page_queue_size => 100,

...

（以下用於在線155）

157  page_queue = SizedQueue.new(@opts[:max_page_queue_size])

和我有一個小時的cron作業做：

#!/usr/bin/env python 
import redis 
r = redis.Redis() 
r.flushall()

，要盡力保持Redis的內存使用了。我現在正在重新開始一個巨大的爬行，所以我們會看看它是怎麼回事！

我會報告結果...

來源

2012-04-27 18:58:39 Andbdrew

我正在做類似的事情，我想也許你只是創建大量的數據。

你沒有保存身體，所以應該幫助記憶的要求。

我能想到的唯一的其他改進是使用Redis代替Mongo，因爲我發現它對於Anemone的存儲更具可擴展性。

檢查您在mongo中的數據大小 - 我發現我保存了大量的行。

來源

2012-03-26 13:31:42

Ruby，Mongodb，Anemone：可能會發生內存泄漏的Web爬蟲？

回答

相關問題