2014-01-06 63 views
0

下面列出的是我寫的一個刮板的代碼。我需要幫助,爲這個刮刀增加延遲。我想每小時刮一個頁面。如何用nokogiri每小時刮一個網站

require 'open-uri' 
require 'nokogiri' 
require 'sanitize' 

class Scraper 

    def initialize(url_to_scrape) 
     @url = url_to_scrape 
    end 

    def scrape 
     # TO DO: change to JSON 
     # page = Nokogiri::HTML(open(@url)) 
     puts "Initiating scrape..." 
     raw_response = open(@url) 
     json_response = JSON.parse(raw_response.read) 
     page = Nokogiri::HTML(json_response["html"]) 

     # your page should now be a hash. You need the page["html"] 

     # Change this to parse the a tags with the class "article_title" 
     # and build the links array for each href in these article_title links 
     puts "Scraping links..." 
     links = page.css(".article_title") 
     articles = [] 

     # everything else here should work fine. 
     # Limit the number of links to scrape for testing phase 
     puts "Building articles collection..." 
     links.each do |link| 
      article_url = "http://seekingalpha.com" + link["href"] 
      article_page = Nokogiri::HTML(open(article_url)) 
      article = {} 
      article[:company] = article_page.css("#about_primary_stocks").css("a") 
      article[:content] = article_page.css("#article_content") 
      article[:content] = Sanitize.clean(article[:content].to_s) 
      unless article[:content].blank? 
       articles << article 
      end 
     end 

     puts "Clearing all existing transcripts..." 
     Transcript.destroy_all 
     # Iterate over the articles collection and save each record into the database 
     puts "Saving new transcripts..." 
     articles.each do |article| 
      transcript = Transcript.new 
      transcript.stock_symbol = article[:company].text.to_s 
      transcript.content = article[:content].to_s 
      transcript.save 
     end 

     #return articles 
    end 

end 

回答

1

那麼當你完成刮擦時,你正在用文章數組做什麼?

我不確定它是否是您要查找的內容,但我只是使用cron來計劃每小時運行一次該腳本。 如果你的腳本是一個更大的應用程序的一部分 - 有一個整潔的寶石叫做whenever,它爲cron任務提供了一個ruby包裝。

希望它有幫助