如何使用Ruby on Rails獲取特定網站的所有頁面

我目前正在使用Ruby on Rails（Ruby：2.2.1，Rails：4.2.1）構建網站，並且想從特定的提取數據網站，然後顯示它。我使用Nokogiri來獲取網頁內容。我正在尋找的是獲取本網站的所有頁面並獲取他們的內容。如何使用Ruby on Rails獲取特定網站的所有頁面

下面我的代碼：

doc = Nokogiri::HTML(open("www.google.com").read) puts doc.at_css('title').text puts doc.to_html

來源

2015-07-28 Maria Minh

你需要的代碼是相當複雜的，你寫的東西像它的1％。您基本上需要遍歷頁面上的所有鏈接，當您獲取，過濾掉外部鏈接並存儲已獲取頁面的數組時，以避免重複調用。 –

您應該搜索堆棧溢出。這條線有很多問題。這裏有一些指針：http://stackoverflow.com/a/4981595/128421 –

這是你需要什麼是非常近似的要點是：

class Parser 
    attr_accessor :pages 

    def fetch_all(host) 
    @host = host 

    fetch(@host) 
    end 

    private 

    def fetch(url) 
    return if pages.any? { |page| page.url == url } 
    parse_page(Nokogiri::HTML(open(url).read)) 
    end 

    def parse_page(document) 
    links = extract_links(document) 

    pages << Page.new(
     url: url, 
     title: document.at_css('title').text, 
     content: document.to_html, 
     links: links 
    ) 

    links.each { |link| fetch(@host + link) } 
    end 

    def extract_links(document) 
    document.css('a').map do |link| 
     href = link['href'].gsub(@host, '') 
     href if href.start_with?('/') 
    end.compact.uniq 
    end 
end 

class Page 
    attr_accessor :url, :title, :html_content, :links 

    def initialize(url:, title:, html_content:, links:) 
    @url = url 
    @title = title 
    @html_content = html_content 
    @links = links 
    end 
end

來源

2015-07-28 10:50:04

如何使用Ruby on Rails獲取特定網站的所有頁面

回答

相關問題