如何使用Nokogiri獲取包含特定標籤的所有文本？

我有以下XML：如何使用Nokogiri獲取包含特定標籤的所有文本？

<w:body> 
    <w:p w14:paraId="15812FB6" w14:textId="27A946A1" w:rsidR="001665B3" w:rsidRDefault="00771852"> 
    <w:r> 
     <w:t xml:space="preserve">I am writing this </w:t> 
    </w:r> 
    <w:ins w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="0"> 
     <w:r w:rsidR="00A1573E"> 
     <w:t>text to look</w:t> 
     </w:r> 
    </w:ins> 
    <w:del w:author="Mitchell Gould" w:date="2016-10-04T17:24:00Z" w:id="1"> 
     <w:r w:rsidDel="00A1573E"> 
     <w:delText>to test</w:delText> 
     </w:r> 
    </w:del> 
...

我知道我得到的使用得到的所有文字：

only_text_array = @file.search('//text()')

不過，其實我是想兩個文本集：

一其中包含除<w:del>...</w:del>元素之外的所有文本。
另一個包含除<w:ins>...</w:ins>元素的文本以外的所有文本。

我該如何做到這一點？

來源

2016-10-04 chell

您可以嘗試使用以下XPath：

//text()[not(ancestor::w:del or ancestor::w:ins)]

xpatheval demo

這個XPath返回的所有文本節點，其中沒有祖先是w:del或w:ins

來源

2016-10-04 13:38:47 har07

正是我在找的東西。感謝har07。 – chell

我會做這樣的事情：

require 'nokogiri' 

doc = Nokogiri::HTML(<<EOT) 
<html> 
    <body> 
    <p class="ignore">foobar</p> 
    <p>Keep this</p> 
    <p class="ignore2">foobar2</p> 
    </body> 
</html> 
EOT 

text1, text2 = %w[.ignore .ignore2].map do |s| 
    tmp_doc = doc.dup 
    tmp_doc.search(s).remove 
    tmp_doc.text.strip 
end 

text1 # => "Keep this\n foobar2" 
text2 # => "foobar\n Keep this"

它遍歷不需要的東西的選擇器列表，dup s文檔，然後刪除不需要的節點，並在一些清理後返回文檔的文本。

dup默認進行深度複製，因此刪除節點不會影響doc。

來源

2016-10-04 21:30:10

如何使用Nokogiri獲取包含特定標籤的所有文本？

回答

相關問題