2011-10-01 65 views
2

什麼是Mechanize的最簡單的解決方案,以查看頁面是否已更新?機械化 - 檢查頁面是否已更新的最簡單方法?

我正在考慮創建一個名爲pages的表。

這將有:

pagename - varchar 
page - text 
pageupdated - boolean 

我應該如何創建屏幕刮刀,並保存在數據庫中的數據? 以及如何創建一個方法來比較表中的html和刮取的數據。檢查頁面是否已更新。

回答

1

答案更新和測試。

下面是一個使用頁面模型(使用retryable-rb)的例子:

####### app/models/page.rb 

require 'digest' 
require 'retryable' 

class Page < ActiveRecord::Base 
    include Retryable 

    # Scrape page before validation 
    before_validation :scrape_content, :if => :remote_url? 

    # Will cause save to fail if page could not be retrieved 
    validates_presence_of :page, :if => :remote_url?, :message => "URL provided is invalid or inaccessible." 

    # Update digest if/when all validations have passed 
    before_save :set_digest 

    # ... 

    def update_page! 
    self.scrape_content 
    self.set_digest 
    self.save! 
    end 

    def page_updated? 
    self.page_updated 
    end 

    protected 

    def scrape_content 
    ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X) ' + 
     'AppleWebKit/535.1 (KHTML, like Gecko) ' + 
     'Chrome/14.0.835.186 Safari/535.1' 

    # Using retryable, create scraper and get page 
    scraper = Mechanize.new{ |i| i.user_agent = ua } 
    scraped_page = retryable(:times => 3, :sleep => false) do 
     scraper.get(URI.encode(self.remote_url)) 
    end 
    self.page_updated = false 
    self.page = scraped_page.content 
    self.name ||= scraped_page.title 
    self.digest ||= Digest.hexencode(self.page) 
    end 

    def set_digest 
    # Create new digest of page content 
    new_digest = Digest.hexencode(self.page) 

    # If digest has changed, update digest and set flag 
    if (new_digest != self.digest) && !self.digest.nil? 
     self.digest = new_digest 
     self.page_updated = true 
    else 
     self.page_updated = false 
    end 

    true 
    end 

end 

我布爾:

軌生成腳手架頁面名稱:字符串remote_url:字符串頁:文本摘要:文本page_updated相當肯定這是一個無關緊要的事情,但我似乎遇到 LoadError當試圖 require 'mechanize'rails console和我的測試應用程序。不知道是什麼原因造成的,但是當我能夠成功地測試這個解決方案時,我會更新我的答案。

請務必記得將它添加到您的應用程序的Gemfile

gem 'mechanize', '2.0.1' 
gem 'retryable-rb', '1.1.0' 

用法示例

p = Page.new(:remote_url => 'http://rubyonrails.org/') 
p.save! 
p.page_updated? # => false, since page hasn't been updated since creation 
p.remote_url = 'http://www.google.com/' # for the sake of example 
p.update_page! 
p.page_updated? # => true 
相關問題