2013-12-17 35 views
0

我使用以下從this dashing widget獲取RSS提要並解析它的ruby腳本,並將解析後的標題和描述發送到小部件。如何以xml格式爲ruby腳本獲取RSS提要

require 'net/http' 
require 'uri' 
require 'nokogiri' 
require 'htmlentities' 

news_feeds = { 
    "seattle-times" => "http://seattletimes.com/rss/home.xml", 
} 

Decoder = HTMLEntities.new 

class News 
    def initialize(widget_id, feed) 
    @widget_id = widget_id 
    # pick apart feed into domain and path 
    uri = URI.parse(feed) 
    @path = uri.path 
    @http = Net::HTTP.new(uri.host) 
    end 

    def widget_id() 
    @widget_id 
    end 

    def latest_headlines() 
    response = @http.request(Net::HTTP::Get.new(@path)) 
    doc = Nokogiri::XML(response.body) 
    news_headlines = []; 
    doc.xpath('//channel/item').each do |news_item| 
     title = clean_html(news_item.xpath('title').text) 
     summary = clean_html(news_item.xpath('description').text) 
     news_headlines.push({ title: title, description: summary }) 
    end 
    news_headlines 
    end 

    def clean_html(html) 
    html = html.gsub(/<\/?[^>]*>/, "") 
    html = Decoder.decode(html) 
    return html 
    end 

end 

@News = [] 
news_feeds.each do |widget_id, feed| 
    begin 
    @News.push(News.new(widget_id, feed)) 
    rescue Exception => e 
    puts e.to_s 
    end 
end 

SCHEDULER.every '60m', :first_in => 0 do |job| 
    @News.each do |news| 
    headlines = news.latest_headlines() 
    send_event(news.widget_id, { :headlines => headlines }) 
    end 
end 

示例rss供稿正常工作,因爲該URL是針對xml文件的。不過,我想用這個不同的RSS提要,不提供實際的XML文件。此rss飼料我想要的是在http://www.ttc.ca/RSS/Service_Alerts/index.rss 這似乎不顯示任何東西在小部件上。我沒有使用「http://www.ttc.ca/RSS/Service_Alerts/index.rss」,而是嘗試了「http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml」和「查看源代碼:http://www.ttc.ca/RSS/Service_Alerts/index.rss」,但沒有運氣。有誰知道我可以如何獲得與此rss提要相關的實際xml數據,以便我可以將其與此ruby腳本一起使用?

+0

你應該接受diego.greyrobot的回答,因爲它是正確的,所以他可以獲得他的得分積分 –

回答

2

說得沒錯,那個鏈接並沒有提供正規的XML,所以這個腳本在解析它時不起作用,因爲它是專門爲解析示例XML而編寫的。您試圖解析的rss提要正在提供RDF XML,您可以使用Rubygem:RDFXML來解析它。

喜歡的東西:

require 'nokogiri' 
require 'rdf/rdfxml' 

rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss' 

RDF::RDFXML::Reader.open(rss_feed) do |reader| 
    # use reader to iterate over elements within the document 
end 

從這裏,你可以嘗試學習如何使用RDFXML提取你想要的內容。我想通過檢查的方法,讀者對象,我可以用開始:

puts reader.methods.sort - Object.methods 

,將打印出來的讀者自己的方法,找一個你也許可以用你的目的,如reader.each_entry

爲了進一步向下挖,你可以檢查每一個條目的樣子:

reader.each_entry do |entry| 
    puts "----here's an entry----" 
    puts entry.inspect 
end 

或看到你可以在進入所謂的方法:

reader.each_entry do |entry| 
    puts "----here's an entry's methods----" 
    puts entry.methods.sort - Object.methods 
    break 
end 

我能粗略地找到一些標題和使用這個技巧的工作描述:

RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader| 
    reader.each_object do |object| 
    puts object.to_s if object.is_a? RDF::Literal 
    end 
end 

# returns: 

# TTC Service Alerts 
# http://www.ttc.ca/Service_Advisories/index.jsp 

#  TTC Service Alerts. 

# TTC.ca 
# http://www.ttc.ca 
# http://www.ttc.ca/images/ttc-main-logo.gif 
# Service Advisory 
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory 

# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way. 
# - Affecting: Bus Routes: 196 York University Rocket 
# 2013-12-17T13:49:03.800-05:00 
# Service Advisory (2) 
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2) 

# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way. 
# - Affecting: Bus Routes: 107 Keele North 
# 2013-12-17T13:51:08.347-05:00 

但我不能很快找到一個辦法知道哪一個是標題,並說明:/

最後,如果你仍然無法找到你想要的東西,請用這個信息開始一個新的問題。

祝你好運!

+0

謝謝你,這是非常有用的 – user1893354