Sidekiq機械化覆蓋實例

我正在用Sidekiq和Mechanize構建一個簡單的網絡蜘蛛。Sidekiq機械化覆蓋實例

當我運行這個爲一個域，它工作正常。當我爲多個域運行它時，它會失敗。我相信原因是被另一個Sidekiq工作者實例化時被覆蓋，但我不確定這是真的還是如何解決它。

# my scrape_search controller's create action searches on google. 
def create 
    @scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession]) 
    agent = Mechanize.new 
    scrape_search = agent.get('http://google.com/') do |page| 
    search_result = page.form... 
    search_result.css("h3.r").map do |link| 
     result = link.at_css('a')['href'] # Narrowing down to real search results 
     @domain = Domain.new(some params) 
     ScrapeDomainWorker.perform_async(@domain.url, @domain.id, remaining_keywords) 
    end 
    end 
end

我爲每個域創建一個Sidekiq作業。我所尋找的大部分域名都應該只包含幾頁，因此不需要每頁的子作業。

這是我的工作人員：

class ScrapeDomainWorker 
    include Sidekiq::Worker 
    ... 

    def perform(domain_url, domain_id, keywords) 
    @domain  = Domain.find(domain_id) 
    @domain_link = @domain.protocol + '://' + domain_url 
    @keywords  = keywords 

    # First we scrape the homepage and get the first links 
    @domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain 
    mechanize_path('/') 
    @domain.verified << '/' # verified is an Array field containing valid domain paths 
    get_paths(@web_page) # Now we should have to_scrape populated with homepage links 

    @domain.scraped = 1 # Loop counter 
    while @domain.scraped < 100 
     @domain.to_parse.each do |path| 
     @domain.to_parse.delete(path) 
     @domain.scraped += 1 
     mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path 
     ... 
     get_paths(@web_page) # Fire this to repopulate to_scrape !!! 
     end 
    end 
    @domain.save 
    end 

    def mechanize_path(path) 
    agent = Mechanize.new 
    begin 
     @web_page = agent.get(@domain_link + path) 
    rescue Exception => e 
     puts "Mechanize Exception for #{path} :: #{e.message}" 
    end 
    end 

    def get_paths(web_page) 
    paths = web_page.links.map {|link| link.href.gsub((@domain.protocol + '://' + @domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains. 
    paths.uniq.each do |path| 
     @domain.to_parse << path 
    end 
    end 

end

這時候我颳了單一領域的作品，但沒有與.gsub for nil爲web_page，當我颳了幾個域。

來源

2016-06-13 Ben

歡迎來到Stack Overflow。請閱讀「[mcve]」。請將代碼減少到最低程度，以再現問題。 –

你可以用你密碼在另一個類，然後創建和你的工人中該類的對象：

class ScrapeDomainWrapper 
    def initialize(domain_url, domain_id, keywords) 
    # ... 
    end 

    def mechanize_path(path) 
    # ... 
    end 

    def get_paths(web_page) 
    # ... 
    end 
end

而且你的工人：

class ScrapeDomainWorker 
    include Sidekiq::Worker 

    def perform(domain_url, domain_id, keywords) 
    ScrapeDomainWrapper.new(domain_url, domain_id, keywords) 
    end 
end

另外，請記住，Mechanize::Page#links可能是nil。

來源

2016-06-13 22:45:26 Wikiti

我按照你的建議來包裝它。我還將所有實例變量轉換爲本地變量（@ web_page成爲web_page等）。我仍然得到一個「未定義的方法'gsub'for nil：NilClass」for paths = web_page.links.map {| link | link.href.gsub（（@ domain.protocol +'：//'+ @ domain.url），「」）}。奇怪的是，如果我單獨運行它，它工作得很好。 – Ben

如果您將代碼移到另一個類中，則不需要重命名變量。只要它們不是類變量，而是實例變量，一切都可以。另外，我認爲在某些情況下'Mechanize :: Link＃href'可能是'nil'。你應該檢查一下。 – Wikiti

是的，我爲此添加了失敗保護。謝謝您的幫助！ – Ben

Sidekiq機械化覆蓋實例

回答

相關問題