2013-03-08 28 views
0

我的目標:我想抓住主題xml文檔中名爲「SECTION」的每個元素;獲取每個SECTION,以及它下面的所有內容。使用LibXML解析xml Ruby截斷數據

約束條件:我必須使用LibXML Ruby;即需要'xml'。

問題:輸出數據被截斷。

問題(見輸出file1.xml)

  • 爲什麼在file1.xml輸出截斷?注意:第一個 P(a).../P標籤(注意:截斷始於單詞「ethic ...」)之間的大部分文本
  • 爲什麼代碼會丟棄最後兩個P元素P(b)...,P(2)...) 和CITA元素?什麼原因造成的?xml version =「1.0」 encoding =「UTF-8」?和SECTION /出現在輸出結尾?

注意:輸出file2.xml具有更嚴重的截斷。我將它包括在內,以防它澄清任何事情。

下面是代碼:

#!/usr/bin/ruby 
require "xml" 
reader = XML::Reader.file('infile2.xml') 
while reader.read 
    node = reader.node 
    if node.name == "SECTION" 
     iteration = XML::Document.string(node.to_s) 
     puts iteration 
     puts "\n" 
    end 
end 

輸入file1.xml:

<?xml version="1.0"?> 
<SECTION> 
    <SECTNO>§ 0.735-1</SECTNO> 
    <SUBJECT>Agency ethics officials.</SUBJECT> 
    <P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethics program, pursuant to 5 CFR 2638.201-204.</P> 
    <P>(b) <E T="03">Deputy ethics officials.</E> (1) The Regional Counsel are deputy ethics officials. They have been delegated the authority to act for the DAEO within their jurisdiction, under the DAEO's supervision, pursuant to 5 CFR 2638.204.</P> 
    <P>(2) The alternate DAEO, the DAEO's staff, and staff in the Offices of Regional Counsel, may also act as deputy ethics officials pursuant to delegations of one or more of the DAEO's duties from the DAEO or the Regional Counsel.</P> 
    <CITA>[58 FR 61813, Nov. 23, 1993. Redesignated at 61 FR 11309, Mar. 20, 1996]</CITA> 
</SECTION> 

輸出,給定的輸入file1.xml(上圖):

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION> 
    <SECTNO>§ 0.735-1</SECTNO> 
    <SUBJECT>Agency ethics officials.</SUBJECT> 
    <P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethic</P></SECTION> 

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION/> 

輸入file2的.xml:

<?xml version="1.0"?> 
<SUBPART> 
    <HD SOURCE="HED">Subpart A—General Provisions</HD> 
    <SECTION> 
    <SECTNO>§ 0.735-1</SECTNO> 
    <SUBJECT>Agency ethics officials.</SUBJECT> 
    <P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethics program, pursuant to 5 CFR 2638.201-204.</P> 
    <P>(b) <E T="03">Deputy ethics officials.</E> (1) The Regional Counsel are deputy ethics officials. They have been delegated the authority to act for the DAEO within their jurisdiction, under the DAEO's supervision, pursuant to 5 CFR 2638.204.</P> 
    <P>(2) The alternate DAEO, the DAEO's staff, and staff in the Offices of Regional Counsel, may also act as deputy ethics officials pursuant to delegations of one or more of the DAEO's duties from the DAEO or the Regional Counsel.</P> 
    <CITA>[58 FR 61813, Nov. 23, 1993. Redesignated at 61 FR 11309, Mar. 20, 1996]</CITA> 
    </SECTION> 
    <SECTION> 
    <SECTNO>§ 0.735-2</SECTNO> 
    <SUBJECT>Government-wide standards.</SUBJECT> 
    <P>For government-wide standards of ethical conduct and related responsibilities for Federal employees, see 5 CFR Part 735 and Chapter XVI.</P> 
    <CITA>[61 FR 11309, Mar. 20, 1996. Redesignated at 63 FR 33579, June 19, 1998]</CITA> 
    </SECTION> 
</SUBPART> 

輸出,給定的輸入file2.xml(上圖):

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION> 
    <SECTNO>§ 0.735-1</SECTNO> 
    <SUBJECT>Agency ethics officials.</SUBJECT> 
    <P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E></P></SECTION> 

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION/> 

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION> 
    <SECTNO>§ 0.735-2</SECTNO> 
    <SUBJECT>Government-wide standards.</SUBJECT> 
    <P>For government-wide standards of ethical conduct and related responsibilities for Federal employees, see 5 CFR Part 735 and Chapter XVI.</P> 
    <CITA/></SECTION> 

<?xml version="1.0" encoding="UTF-8"?> 
<SECTION/> 
+0

我不確定具體問題是什麼,但我懷疑它與內容不僅僅是文本,例如,它包含嵌套節點。最好的辦法是將文檔看作XML,而不是文本。 – 2013-03-08 19:36:24

+0

@DaveNewton,我_thought_我正在處理文檔,就像它是XML,但(顯然)很困惑。我的思想在哪裏錯了?我正在嘗試獲取SECTION節點以及所有內容,包括嵌套節點及其內容。 - 感謝任何想法。 – HisHighnessDog 2013-03-08 19:44:36

回答

0

除非你有一個巨大的XML文檔,而不是考慮這樣的事情:

require "xml" 
doc = XML::Document.file('infile1.xml') 
doc.find('/SECTION').each do |s| 
    puts "[#{s}]" 
end 

此輸出:

<SECTION> 
    <SECTNO>§ 0.735-1</SECTNO> 
    <SUBJECT>Agency ethics officials.</SUBJECT> 
    <P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethics program, pursuant to 5 CFR 2638.201-204.</P> 
    <P>(b) <E T="03">Deputy ethics officials.</E> (1) The Regional Counsel are deputy ethics officials. They have been delegated the authority to act for the DAEO within their jurisdiction, under the DAEO's supervision, pursuant to 5 CFR 2638.204.</P> 
    <P>(2) The alternate DAEO, the DAEO's staff, and staff in the Offices of Regional Counsel, may also act as deputy ethics officials pursuant to delegations of one or more of the DAEO's duties from the DAEO or the Regional Counsel.</P> 
    <CITA>[58 FR 61813, Nov. 23, 1993. Redesignated at 61 FR 11309, Mar. 20, 1996]</CITA> 
</SECTION> 

這不回答問題,而是解決方法。

我不確定使用讀取器的實際問題是什麼,但我懷疑它與光標ish有關。例如,下面的工作,增量還有第一個XML文檔的額外的空白部分:

if node.name == "SECTION" 
    puts "#{reader.read_outer_xml}" 
end 
+0

如果我找出這種行爲背後的「原因」,我會報告回來。感謝您的解決方法。 – HisHighnessDog 2013-03-09 00:09:01