2015-10-19 61 views
1
<instance id="activate.v.bnc.00024693" docsrc="BNC"> 
<answer instance="activate.v.bnc.00024693" senseid="38201"/> 
<context> 
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
</context> 
</instance> 

我想提取裏面的所有文本。這是我目前擁有的。 stuff.text只會在<head></head>之前打印文本(即,你知道......繼續),但我不知道如何在</head>之後提取後半部分(即使用...很容易處理)。xml解析這個特定的xml

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print stuff.text 

回答

0

如果使用BeautifulSoup是一個選項,這將是微不足道的:

import bs4 
xtxt = '''  <instance id="activate.v.bnc.00024693" docsrc="BNC"> 
    <answer instance="activate.v.bnc.00024693" senseid="38201"/> 
    <context> 
    Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
    </context> 
    </instance>''' 
soup = bs4.BeautifulSoup(xtxt) 
print soup.find('context').text 

給出:

Do you know what it is , and where I can get one ? We suspect you had 
seen the Terrex Autospade , which is made by Wolf Tools . It is quite 
a hefty spade , with bicycle - type handlebars and a sprung lever at the 
rear , which you step on to activate it . Used correctly , you shouldn't 
have to bend your back during general digging , although it wo n't lift 
out the soil and put in a barrow if you need to move it ! If gardening 
tends to give you backache , remember to take plenty of rest periods 
during the day , and never try to lift more than you can easily cope 
with . 

如果您prefere使用ElementTree的,你應該使用itertext來處理所有文本:

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print ''.join(stuff.itertext()) 

如果您確信您的XML文件是正確的,ElementTree的是不夠,因爲它是標準Python庫的一部分,你不會有任何外部扶養。但是如果XML可能不健全,BeautifulSoup擅長修復小錯誤。

+0

謝謝版本太多:) – needhelp

0

可以使用元素序列化。有兩個選項:

  • 保持內部<head></head>
  • 回報只是文本沒有任何標籤。

與標籤序列的情況下,外部<context></context>標籤可以手動刪除:

# convert element to string and remove <context></context> tag 
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>'))) 
# read only text without any tags 
print(et.tostring(stuff, method='text'))