如何將xml文本與Python中前面的空元素相關聯？

我已經繼承了一些我需要在Python中處理的xml。我正在使用xml.etree.cElementTree，我在將空元素後面的文本與空元素的標記關聯時遇到了一些問題。這個xml比我下面粘貼的要複雜得多，但我簡化了它，使問題更加清晰（我希望！）。如何將xml文本與Python中前面的空元素相關聯？

我想有其結果是這樣一個字典：

期望結果

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}

元組還可以包含字符串（例如，('9', '1')）。我真的不在乎這個早期階段。

這裏是XML：

test1.xml

<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> <!-- The empty element --> 
     As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
    <section num="2"/> <!-- Another empty element --> 
     poverty, itch, and pride. 
    </p> 
</div1>

我曾嘗試

嘗試1

>>> import xml.etree.cElementTree as ET 
>>> tree = ET.parse('test1.xml') 
>>> root = tree.getroot() 
>>> chapter = root.attrib['num'] 
>>> d = dict() 
>>> for p in root: 
    for section in p: 
     d[(int(chapter), int(section.attrib['num']))] = section.text 


>>> d 
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty

嘗試2

>>> for p in root: 
    for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense 
     d[(int(chapter), int(section.attrib['num']))] = text.strip() 


>>> d 
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}

正如你可以在後面的嘗試看，p和p.itertext()是兩個不同的長度。 (9, 2)的值是我試圖與關鍵字(9, 1)關聯的值，而我想與(9, 2)關聯的值甚至沒有出現在d中（因爲zip截斷了較長的p.itertext()）。

任何幫助，將不勝感激。提前致謝。

來源

2013-12-21 user3079064

您是否嘗試過使用.tail？

import xml.etree.cElementTree as ET 

txt = """<div1 type="chapter" num="9"> 
     <p> 
      <section num="1"/> <!-- The empty element --> 
      As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
      <section num="2"/> <!-- Another empty element --> 
      poverty, itch, and pride. 
     </p> 
     </div1>""" 
root = ET.fromstring(txt) 
for p in root: 
    for s in p: 
     print s.attrib['num'], s.tail

來源

2013-12-21 21:48:45 ChrisP

輝煌。像魅力一樣工作。謝謝。 – user3079064

我會用BeautifulSoup此：

from bs4 import BeautifulSoup 

html_doc = """<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> 
     As they say, A student has usually three maladies: 
    <section num="2"/> 
     poverty, itch, and pride. 
    </p> 
</div1>""" 

soup = BeautifulSoup(html_doc) 

result = {} 
for chapter in soup.find_all(type='chapter'): 
    for section in chapter.find_all('section'): 
     result[(chapter['num'], section['num'])] = section.next_sibling.strip() 

import pprint 
pprint.pprint(result)

此打印：

{(u'9', u'1'): u'As they say, A student has usually three maladies:', 
(u'9', u'2'): u'poverty, itch, and pride.'}

來源

2013-12-21 21:59:39 jterrace

如何將xml文本與Python中前面的空元素相關聯？

回答

相關問題