2016-01-01 73 views
1

我使用的元素樹,比如伊夫這個XML代碼打印Python中兩個XML標記之間的內容?

<TEXT> 
<PHRASE> 
<CONJ>and</CONJ> 
<V>came</V> 
<en x='PERS'>Adam</en> 
<PREP>from</PREP> 
<en x='LOC'>Atlanta</en> 
</PHRASE> 
<PHRASE> 
<en x='ORG'>Alpha</en> 
<ADJ y='1'>Amazingly</ADJ> 
<N>created by</N> 
<en x='PERS'>John</en> 
</PHRASE> 
</TEXT> 

我想是打印整個短語時,我有在ORG =「阿爾法」恩標籤和PERS =「約翰」在其他EN標籤,我想輸出爲

我知道如何尋找Alpha和約翰「阿爾法令人驚訝的是由約翰·創造」,但我的問題是印刷

之間有什麼在
for phrase in root.findall('./PHRASE'): 
    ens = {en.get('x'): en.text for en in phrase.findall('en')} 
    if 'ORG' in ens and 'PERS' in ens: 
     if (ens["ORG"] =="Alpha" and ens["PERS"]=="John"): 
      print("ORG is: {}, PERS is: {} /".format(ens["ORG"], ens["PERS"])) 

但是如何在該短語中打印標籤文本的其餘部分。

+0

[這可能是相關的(http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained -tags/1732454#1732454)或試着看[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) – javanut13

回答

0
import xml.etree.ElementTree as ET 

xml = ''' 
<TEXT> 
<PHRASE> 
<CONJ>and</CONJ> 
<V>came</V> 
<en x='PERS'>Adam</en> 
<PREP>from</PREP> 
<en x='LOC'>Atlanta</en> 
</PHRASE> 
<PHRASE> 
<en x='ORG'>Alpha</en> 
<ADJ y='1'>Amazingly</ADJ> 
<N>created by</N> 
<en x='PERS'>John</en> 
</PHRASE> 
</TEXT> 
''' 

def section(seq, start, end): 
    returning = False 
    for item in seq: 
    returning |= item == start 
    if returning: 
     yield item 
    returning &= item != end 

root = ET.fromstring(xml) 
for phrase in root.findall('./PHRASE'): 
    ens = {en.get('x'): en for en in phrase.findall('en')} 
    if 'ORG' in ens and 'PERS' in ens: 
     if (ens["ORG"].text =="Alpha" and ens["PERS"].text=="John"): 
      print("ORG is: {}, PERS is: {} /".format(ens["ORG"].text, ens["PERS"].text)) 
      print(' '.join(el.text for el in section(phrase, ens["ORG"], ens["PERS"]))) 
0

很簡單:

import xml.etree.ElementTree as ET 

data = """<TEXT> 
    <PHRASE> 
     <CONJ>and</CONJ> 
     <V>came</V> 
     <en x='PERS'>Adam</en> 
     <PREP>from</PREP> 
     <en x='LOC'>Atlanta</en> 
    </PHRASE> 
    <PHRASE> 
     <en x='ORG'>Alpha</en> 
     <ADJ y='1'>Amazingly</ADJ> 
     <N>created by</N> 
     <en x='PERS'>John</en> 
    </PHRASE> 
</TEXT>""" 

root = ET.fromstring(data) 

for node in root.findall('./PHRASE'): 
    ens = [node.find('en[@x="ORG"]'), node.find('en[@x="PERS"]')] 

    if all([i is not None for i in ens]): 
     if 'Alpha' in ens[0].text and 'John' in ens[1].text:    
      print (" ".join(node.itertext())) 
      # If you want remove eol (end of line chars) for each item: 
      # " ".join([t.strip() for t in node.itertext()]) 
      break 
相關問題