我的目標:我想抓住主題xml文檔中名爲「SECTION」的每個元素;獲取每個SECTION,以及它下面的所有內容。使用LibXML解析xml Ruby截斷數據
約束條件:我必須使用LibXML Ruby;即需要'xml'。
問題:輸出數據被截斷。
問題(見輸出file1.xml):
- 爲什麼在file1.xml輸出截斷?注意:第一個 P(a).../P標籤(注意:截斷始於單詞「ethic ...」)之間的大部分文本
- 爲什麼代碼會丟棄最後兩個P元素P(b)...,P(2)...) 和CITA元素?什麼原因造成的?xml version =「1.0」 encoding =「UTF-8」?和SECTION /出現在輸出結尾?
注意:輸出file2.xml具有更嚴重的截斷。我將它包括在內,以防它澄清任何事情。
下面是代碼:
#!/usr/bin/ruby
require "xml"
reader = XML::Reader.file('infile2.xml')
while reader.read
node = reader.node
if node.name == "SECTION"
iteration = XML::Document.string(node.to_s)
puts iteration
puts "\n"
end
end
輸入file1.xml:
<?xml version="1.0"?>
<SECTION>
<SECTNO>§ 0.735-1</SECTNO>
<SUBJECT>Agency ethics officials.</SUBJECT>
<P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethics program, pursuant to 5 CFR 2638.201-204.</P>
<P>(b) <E T="03">Deputy ethics officials.</E> (1) The Regional Counsel are deputy ethics officials. They have been delegated the authority to act for the DAEO within their jurisdiction, under the DAEO's supervision, pursuant to 5 CFR 2638.204.</P>
<P>(2) The alternate DAEO, the DAEO's staff, and staff in the Offices of Regional Counsel, may also act as deputy ethics officials pursuant to delegations of one or more of the DAEO's duties from the DAEO or the Regional Counsel.</P>
<CITA>[58 FR 61813, Nov. 23, 1993. Redesignated at 61 FR 11309, Mar. 20, 1996]</CITA>
</SECTION>
輸出,給定的輸入file1.xml(上圖):
<?xml version="1.0" encoding="UTF-8"?>
<SECTION>
<SECTNO>§ 0.735-1</SECTNO>
<SUBJECT>Agency ethics officials.</SUBJECT>
<P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethic</P></SECTION>
<?xml version="1.0" encoding="UTF-8"?>
<SECTION/>
輸入file2的.xml:
<?xml version="1.0"?>
<SUBPART>
<HD SOURCE="HED">Subpart A—General Provisions</HD>
<SECTION>
<SECTNO>§ 0.735-1</SECTNO>
<SUBJECT>Agency ethics officials.</SUBJECT>
<P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E> The Assistant General Counsel (023) is the designated agency ethics official (DAEO) for the Department of Veterans Affairs. The Deputy Assistant General Counsel (023C) is the alternate DAEO, who is designated to act in the DAEO's absence. The DAEO has primary responsibility for the administration, coordination, and management of the VA ethics program, pursuant to 5 CFR 2638.201-204.</P>
<P>(b) <E T="03">Deputy ethics officials.</E> (1) The Regional Counsel are deputy ethics officials. They have been delegated the authority to act for the DAEO within their jurisdiction, under the DAEO's supervision, pursuant to 5 CFR 2638.204.</P>
<P>(2) The alternate DAEO, the DAEO's staff, and staff in the Offices of Regional Counsel, may also act as deputy ethics officials pursuant to delegations of one or more of the DAEO's duties from the DAEO or the Regional Counsel.</P>
<CITA>[58 FR 61813, Nov. 23, 1993. Redesignated at 61 FR 11309, Mar. 20, 1996]</CITA>
</SECTION>
<SECTION>
<SECTNO>§ 0.735-2</SECTNO>
<SUBJECT>Government-wide standards.</SUBJECT>
<P>For government-wide standards of ethical conduct and related responsibilities for Federal employees, see 5 CFR Part 735 and Chapter XVI.</P>
<CITA>[61 FR 11309, Mar. 20, 1996. Redesignated at 63 FR 33579, June 19, 1998]</CITA>
</SECTION>
</SUBPART>
輸出,給定的輸入file2.xml(上圖):
<?xml version="1.0" encoding="UTF-8"?>
<SECTION>
<SECTNO>§ 0.735-1</SECTNO>
<SUBJECT>Agency ethics officials.</SUBJECT>
<P>(a) <E T="03">Designated Agency Ethics Official (DAEO).</E></P></SECTION>
<?xml version="1.0" encoding="UTF-8"?>
<SECTION/>
<?xml version="1.0" encoding="UTF-8"?>
<SECTION>
<SECTNO>§ 0.735-2</SECTNO>
<SUBJECT>Government-wide standards.</SUBJECT>
<P>For government-wide standards of ethical conduct and related responsibilities for Federal employees, see 5 CFR Part 735 and Chapter XVI.</P>
<CITA/></SECTION>
<?xml version="1.0" encoding="UTF-8"?>
<SECTION/>
我不確定具體問題是什麼,但我懷疑它與內容不僅僅是文本,例如,它包含嵌套節點。最好的辦法是將文檔看作XML,而不是文本。 – 2013-03-08 19:36:24
@DaveNewton,我_thought_我正在處理文檔,就像它是XML,但(顯然)很困惑。我的思想在哪裏錯了?我正在嘗試獲取SECTION節點以及所有內容,包括嵌套節點及其內容。 - 感謝任何想法。 – HisHighnessDog 2013-03-08 19:44:36