nokogiri帶狀樣式屬性

我用nokogiri取消了一個html頁面，我想剝去所有樣式屬性。
我該如何做到這一點？（我不使用軌道，所以我不能用它的sanitize方法，我不希望使用的sanitize寶石「因爲我想黑名單中刪除不白名單）nokogiri帶狀樣式屬性

html = open(url) 
doc = Nokogiri::HTML(html.read) 
doc.css('.post').each do |post| 
puts post.to_s 
end 

=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

我希望它是

=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

來源

2011-05-23 keepitterron

require 'nokogiri' 

html = '<p class="post"><span style="font-size: x-large">bla bla</span></p>' 
doc = Nokogiri::HTML(html) 
doc.xpath('//@style').remove 
puts doc.css('.post') 
#=> <p class="post"><span>bla bla</span></p>

編輯以表明你可以叫NodeSet#remove而不必使用.each(&:remove)的。

請注意，如果你有一個DocumentFragment的，而不是一個文件，引入nokogiri具有a longstanding bug其中來自片段搜索你所期望不起作用。解決方法是使用：

doc.xpath('@style|.//@style').remove

來源

2011-05-23 22:26:25 Phrogz

哇。那很簡單！我喜歡它。謝謝！ – keepitterron 2011-05-25 08:14:16

使用'doc.xpath（'.//@style'）。remove'從所有節點中刪除所有內聯樣式，請注意下面的@bricker提到的'.'。鏈'.to_s'獲取生成的html字符串。 – 2014-03-16 01:08:49

更正：不要鏈接它，而是使用'description.to_s'來獲得生成的html字符串。如果您不想使用'DOCTYPE'，則應該使用'Nokogiri :: HTML.fragment'方法，請參閱http://stackoverflow.com/questions/4723344/how-to-prevent-nokogiri-from-adding- doctype-tags – 2014-03-16 01:17:15

我試圖從Phrogz答案，但無法得到它的工作（我使用的是文檔片段，雖然，但我還以爲它應該工作一樣嗎？）。

開始處的「//」似乎沒有像我期望的那樣檢查所有節點。最後我做了一些更長篇大論，但它的工作，所以這裏的情況下，任何人的記錄有相同的麻煩是我的解決方案（髒雖然它是）：

doc = Nokogiri::HTML::Document.new 
body_dom = doc.fragment(my_html) 

# strip out any attributes we don't want 
body_dom.xpath('.//*[@align]|*[@align]').each do |tag| 
    tag.attributes["align"].remove 
end

乾杯

皮特

來源

2012-07-11 10:03:26

這也可能起作用：'body_dom.xpath（'.//@ class'）'（請注意xpath開頭處的額外點） – bricker 2013-01-29 21:24:25

Nokogiri和/或LibXML2具有[XPath內部碎片]（https://github.com/sparklemotion/nokogiri/issues/572）。當前最好的解決方法就像你注意到的那樣：而不是'foo'，你必須使用'foo | .// foo'。 – Phrogz 2014-03-16 02:22:17

這工作既具有文檔和文檔片段：

doc = Nokogiri::HTML::DocumentFragment.parse(...)

或

doc = Nokogiri::HTML(...)

要刪除所有的 '風格' 屬性，你可以做一個

doc.css('*').remove_attr('style')

來源

2014-10-08 01:50:24 PlagueHammer

nokogiri帶狀樣式屬性

回答

相關問題