xml解析這個特定的xml

<instance id="activate.v.bnc.00024693" docsrc="BNC"> 
<answer instance="activate.v.bnc.00024693" senseid="38201"/> 
<context> 
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
</context> 
</instance>

我想提取裏面的所有文本。這是我目前擁有的。 stuff.text只會在<head></head>之前打印文本（即，你知道......繼續），但我不知道如何在</head>之後提取後半部分（即使用...很容易處理）。xml解析這個特定的xml

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print stuff.text

來源

2015-10-19 needhelp

如果使用BeautifulSoup是一個選項，這將是微不足道的：

import bs4 
xtxt = '''  <instance id="activate.v.bnc.00024693" docsrc="BNC"> 
    <answer instance="activate.v.bnc.00024693" senseid="38201"/> 
    <context> 
    Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . 
    </context> 
    </instance>''' 
soup = bs4.BeautifulSoup(xtxt) 
print soup.find('context').text

給出：

Do you know what it is , and where I can get one ? We suspect you had 
seen the Terrex Autospade , which is made by Wolf Tools . It is quite 
a hefty spade , with bicycle - type handlebars and a sprung lever at the 
rear , which you step on to activate it . Used correctly , you shouldn't 
have to bend your back during general digging , although it wo n't lift 
out the soil and put in a barrow if you need to move it ! If gardening 
tends to give you backache , remember to take plenty of rest periods 
during the day , and never try to lift more than you can easily cope 
with .

如果您prefere使用ElementTree的，你應該使用itertext來處理所有文本：

import xml.etree.ElementTree as et 
tree = et.parse(os.getcwd()+"/../data/train.xml") 
instance = tree.getroot() 

    for stuff in instance: 
     if(stuff.tag == "answer"): 
      print "the correct answer is %s" % stuff.get('senseid') 
     if(stuff.tag == "context"): 
      print dir(stuff) 
      print ''.join(stuff.itertext())

如果您確信您的XML文件是正確的，ElementTree的是不夠，因爲它是標準Python庫的一部分，你不會有任何外部扶養。但是如果XML可能不健全，BeautifulSoup擅長修復小錯誤。

來源

2015-10-19 14:57:15

謝謝版本太多:) – needhelp

可以使用元素序列化。有兩個選項：

保持內部<head></head>
回報只是文本沒有任何標籤。

與標籤序列的情況下，外部<context></context>標籤可以手動刪除：

# convert element to string and remove <context></context> tag 
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>'))) 
# read only text without any tags 
print(et.tostring(stuff, method='text'))

來源

2015-10-19 15:33:40

xml解析這個特定的xml

回答

相關問題