使用引入nokogiri和正則表達式

我試圖與嵌入標記標籤解析XML，像這樣的使用Nokigiri和Ruby紅寶石XML文檔中解析編碼標籤「Trennmesser」不在嵌入式標籤內。使用引入nokogiri和正則表達式

在第二個例子：

<seg>Hilfsmittel <ph>&lt;[email protected]@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt;[email protected]@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>

封閉/ph內的文字和開放ph標籤也很有趣，所以正則表達式將需要提取字符串「Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben.」並放棄一切。

我還上傳了這裏的文檔的一部分：
http://pastebin.com/Q8CdnASz

來源

2011-12-24 Vince

試試這個在IRB

require 'nokogiri' 
x = Nokogiri::XML.parse('<seg>Hilfsmittel <ph>&lt;[email protected]@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt;[email protected]@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>') 
x.xpath('//seg').children.reject {|x| x.element?}.join {|x| x.content}

對於我這種輸出

=> "Hilfsmittel X = 0,5mm zwischen Beschleunigerwalze D und Trennmesser schieben."

這裏的想法是，我們迭代<seg>標記的子項，拒絕那些元素本身（<ph>），應該只留下內容元素。獲取結果數組，並將內容元素作爲一個字符串連接在一起。

請注意，輸出與您所描述的略有不同，因爲在兩個標籤之間還有另外的D和X。

來源

2011-12-24 10:07:19 Cyberfox

<ph>標記內的內容已被編碼以保留保留字符<和>。

一個乾淨的方式來處理，這是爲了讓引入nokogiri重新分析這些塊轉化成XML格式：

require 'nokogiri' 

doc = Nokogiri::XML('<seg>Trennmesser <ph>&lt;I.FIGREF ITEM=&quot;3&quot; FORMAT=&quot;PARENTHESIS&quot;&gt;</ph><bpt i="1">&lt;I.FIGTARGET TARGET=&quot;CIADDAJA&quot;&gt;</bpt><ept i="1">&lt;/I.FIGREF&gt;</ept></seg>') 

ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content) 
puts ph.to_xml

，它輸出以下節點，顯示出引入nokogiri重建該片段正確：

<I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>

對於提取<seg>標記內的文字：

doc.at('//seg/text()').text 
=> "Trennmesser "

在處理HTML或XML時，預先假定正則表達式將是提取某些內容的最佳路徑，這絕對不是好事。 HTML和XML都過於不規則和「靈活」（靈活的地方意味着它常常令人煩惱地變形或以完全獨特和意想不到的方式定義）。

要獲得第二個問題的<seg>標籤內的全部內容：

require 'nokogiri' 

doc = Nokogiri::XML('<seg>Hilfsmittel <ph>&lt;[email protected]@Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt;[email protected]@Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>') 

seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content) 
puts seg.content

，輸出：

Hilfsmittel @[email protected]>X = 0,5mm zwischen Beschleunigerwalze @[email protected]>D und Trennmesser schieben.

來源

2012-09-21 16:53:10

使用引入nokogiri和正則表達式

回答

相關問題