2013-12-21 16 views
1

我已經繼承了一些我需要在Python中處理的xml。我正在使用xml.etree.cElementTree,我在將空元素後面的文本與空元素的標記關聯時遇到了一些問題。這個xml比我下面粘貼的要複雜得多,但我簡化了它,使問題更加清晰(我希望!)。如何將xml文本與Python中前面的空元素相關聯?

我想有其結果是這樣一個字典:

期望結果

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'} 

元組還可以包含字符串(例如,('9', '1'))。我真的不在乎這個早期階段。

這裏是XML:

test1.xml

<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> <!-- The empty element --> 
     As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
    <section num="2"/> <!-- Another empty element --> 
     poverty, itch, and pride. 
    </p> 
</div1> 

我曾嘗試

嘗試1

>>> import xml.etree.cElementTree as ET 
>>> tree = ET.parse('test1.xml') 
>>> root = tree.getroot() 
>>> chapter = root.attrib['num'] 
>>> d = dict() 
>>> for p in root: 
    for section in p: 
     d[(int(chapter), int(section.attrib['num']))] = section.text 


>>> d 
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty 

嘗試2

>>> for p in root: 
    for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense 
     d[(int(chapter), int(section.attrib['num']))] = text.strip() 


>>> d 
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''} 

正如你可以在後面的嘗試看,pp.itertext()是兩個不同的長度。 (9, 2)的值是我試圖與關鍵字(9, 1)關聯的值,而我想與(9, 2)關聯的值甚至沒有出現在d中(因爲zip截斷了較長的p.itertext())。

任何幫助,將不勝感激。提前致謝。

回答

1

您是否嘗試過使用.tail

import xml.etree.cElementTree as ET 

txt = """<div1 type="chapter" num="9"> 
     <p> 
      <section num="1"/> <!-- The empty element --> 
      As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
      <section num="2"/> <!-- Another empty element --> 
      poverty, itch, and pride. 
     </p> 
     </div1>""" 
root = ET.fromstring(txt) 
for p in root: 
    for s in p: 
     print s.attrib['num'], s.tail 
+0

輝煌。像魅力一樣工作。謝謝。 – user3079064

0

我會用BeautifulSoup此:

from bs4 import BeautifulSoup 

html_doc = """<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> 
     As they say, A student has usually three maladies: 
    <section num="2"/> 
     poverty, itch, and pride. 
    </p> 
</div1>""" 

soup = BeautifulSoup(html_doc) 

result = {} 
for chapter in soup.find_all(type='chapter'): 
    for section in chapter.find_all('section'): 
     result[(chapter['num'], section['num'])] = section.next_sibling.strip() 

import pprint 
pprint.pprint(result) 

此打印:

{(u'9', u'1'): u'As they say, A student has usually three maladies:', 
(u'9', u'2'): u'poverty, itch, and pride.'} 
相關問題