如何在Rails中使用Nokogiri從URL獲取HTML主體？

我想解析URL中的body屬性。如何在Rails中使用Nokogiri從URL獲取HTML主體？

例如：

url = 'http://rca.yandex.com/?key=rca.1.1.20140120T051507Z.3db118ab435efdff.6c84331313b6b7d66abd191410f72e0e1c3c8795&url=http://endtimeheadlines.wordpress.com/2014/01/17/think-tank-extraordinary-crisis-needed-to-preserve-new-world-order/#comment-36708?utm_source=twitterfeed&utm_medium=facebook[&callback=http://64.191.99.245:3023/posts][&full=1]'

當我嘗試：

page = Nokogiri::HTML(html)

我得到：

#<Nokogiri::HTML::Document:0x52fd6d6 name="document" children=[#<Nokogiri::XML::DTD:0x52fd1f4 name="html">, #<Nokogiri::XML::Element:0x52fc6aa name="html" children=[#<Nokogiri::XML::Element:0x5301f56 name="body" children=[#<Nokogiri::XML::Element:0x53018d0 name="p" children=[#<Nokogiri::XML::Text:0x53015f6 "http://rca.yandex.com/?key=rca.1.1.20140120T051507Z.3db118ab435efdff.6c84331313b6b7d66abd191410f72e0e1c3c8795&url=http://endtimeheadlines.wordpress.com/2014/01/17/think-tank-extraordinary-crisis-needed-to-preserve-new-world-order/#comment-36708?utm_source=twitterfeed&utm_medium=facebook[&callback=http://64.191.99.245:3023/posts][&full=1]">]>]>]>]>

我如何獲得這個URL裏面的屬性？

例如：page.css("div")。我想從HTML body中獲得價值。

來源

2014-01-20 tardjo

我建議您閱讀爲Nokogiri提供的[「搜索」教程]（http://nokogiri.org/tutorials/searching_a_xml_html_document.html）。這足以解釋你想要做什麼。 –

另外，定義「屬性」？ body標籤的屬性通常是任何東西，比如'on_load'。你是指孩子節點還是內部HTML？ –

我的回答有幫助嗎？ –

您可以根據需要使用to_xml或to_html或其他格式。請參考Nokogiri其他格式化選項的文檔。

page = Nokogiri::HTML(html) 
page.to_xml

而得到div的身體在你的document，使用：

divs = page.css('div') # returns either string or array depending upon the number of divs in your document. 
divs.to_xml

來源

2014-01-20 06:26:56 vee

-1

當然，你得到的解析HTML/XML樹已經從鏈接了。而到這裏來是在你例子：

page.css("div")

你只是幽會讓所有的div S的解析文檔作爲Array英寸你聽了一個接一個的列舉出來，並getst每個div的全文：

page.css("div").each do| div | 
    p div.text 
end

來源

2014-01-20 08:57:50

雖然寫假莎士比亞英語似乎很有趣，但請記住，你寫的答案被很多不會說英語爲主要語言的人閱讀。堆棧溢出努力像維基百科的編程問題，所以請堅持輕鬆閱讀，標準英語盡你所能。 –

downvoter希望你解釋下投票？ =） –

page.css('body')應該工作。如果不嘗試使用to_s

來源

2014-01-20 09:59:09 skozz

不要使用'to_s'，因爲結果將是body標籤內容的字符串表示，這不會很有用。此外，'css'不會返回您認爲它的作用。 –

這不完全清楚你想要做什麼，但是這可能幫助：

require 'nokogiri' 

html = '<html><head><title>foo</title><body><p>bar</p></body></html>' 

doc = Nokogiri::HTML(html)

使用at，你會發現標籤的第一次出現，這是明智的在HTML文檔中，因爲您應該只有一個<body>標記。

doc.at('body') # => #<Nokogiri::XML::Element:0x3ff194d24cd4 name="body" children=[#<Nokogiri::XML::Element:0x3ff194d24acc name="p" children=[#<Nokogiri::XML::Text:0x3ff194d248c4 "bar">]>]>

如果你想在標籤的孩子，用children對它們進行檢索：

doc.at('body').children # => [#<Nokogiri::XML::Element:0x3ff194d24acc name="p" children=[#<Nokogiri::XML::Text:0x3ff194d248c4 "bar">]>]

如果你想要得到的子節點爲HTML：

doc.at('body').children.to_html # => "<p>bar</p>" 
doc.at('body').inner_html # => "<p>bar</p>"

如果你想正文標籤的文字內容：

doc.at('body').content # => "bar" 
doc.at('body').text # => "bar"

如果通過「屬性」，你真的是在<body>標籤本身的attributes：

require 'nokogiri' 

html = '<html><head><title>foo</title><body on_load="do_something()"><p>bar</p></body></html>' 

doc = Nokogiri::HTML(html) 
doc.at('body').attributes # => {"on_load"=>#<Nokogiri::XML::Attr:0x3fdc3d923ca0 name="on_load" value="do_something()">} 
doc.at('body')['on_load'] # => "do_something()"

attributes返回哈希，這樣你就可以直接訪問你想要的任何東西。作爲一個快捷方式，Nokogiri :: XML :: Node也理解[]給了我們一個典型的哈希式訪問值。

來源

2014-01-20 16:21:23

如何在Rails中使用Nokogiri從URL獲取HTML主體？

回答

相關問題