2012-11-10 114 views
0

我嘗試了很多東西,但我無法提取head的內容。任何一個幫助?提取頭標籤

原始XML:https://dl.dropbox.com/u/3482709/English_sense_induction.xml.zip

以下是全文:

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE corpus SYSTEM "sense-induction.dtd"> 
<corpus lang="en"> 
    <lexelt item="explain.v"> 
    <instance id="explain.v.4" corpus="wsj"> 
For OPEC , that 's ideal . The resulting firm prices and stability `` will allow both producers and consumers to plan confidently , '' says Saudi Arabian Oil Minister Hisham Nazer . OPEC Secretary-General Subroto <head> explains </head> : Consumers offer security of markets , while OPEC provides security of supply . `` This is an opportune time to find mutual ways { to prevent } price shocks from happening again , '' he says . To promote this balance , OPEC now is finally confronting a long-simmering internal problem . 
</instance> 
    <instance id="explain.v.10" corpus="wsj"> 
and given the right conditions , sympathetic to creating some form of life . Surely at some other cosmic address a Gouldoid creature would have risen out of the ooze to <head> explain </head> why , paleontologically speaking , `` it is , indeed , a wonderful life . '' Mr. Holt is a columnist for the Literary Review in London . 
</instance> 
    <instance id="explain.v.76" corpus="wsj"> 
`` You ca n't build on your hit-and-miss five-seventeen '' . `` What are you playing '' ? ? Owen asked . `` I 'm just logging '' , the cowboy <head> explained </head> . `` I keep all these plays in this little black book , and I watch over a twelve-hour period to find out what numbers are repeating . But roulette 's not my game . 
</instance> 
    </lexelt> 
    <lexelt item="position.n"> 
    <instance id="position.n.288" corpus="wsj"> 
But not everybody was making money . The carnage on the Chicago Board Options Exchange , the nation 's major options market , was heavy after the trading in S&amp;P 100 stock-index options was halted Friday . Many market makers in the S&amp;P 100 index options contract had bullish <head> positions </head> Friday , 
</instance> 
    <instance id="position.n.123" corpus="wsj"> 
An explosion at the Microbiology and Virology Institute in Sverdlovsk released anthrax germs that caused a significant number of deaths . Since Mr. Shevardnadze did not address this topic before the Supreme Soviet , the Soviet Union 's official <head> position </head> remains that the anthrax deaths were caused by 
</instance> 
    </lexelt> 
</corpus> 

編輯

問題是,我忘了xml作爲第二個參數:解決方案是soup = BeautifulSoup(xml_data, 'xml')

回答

1
from bs4 import BeautifulSoup 

soup = BeautifulSoup(xml_data, 'xml') 
head_datas = [head.get_text() for head in soup.find_all('head')] 

head_datas 
>>> [' explains ', ' explain ', ' explained ', ' positions ', ' position '] 

您還可以使用.string屬性如果<head>只包含一個孩子是一個字符串:

head_datas = [head.string for head in soup.find_all('head')] 
+0

你可以更多的價值,這個答案有什麼事情,幫助原來的海報和未來的讀者瞭解一些簡短的說明補充。 –

+0

也許鏈接到文檔:['.get_text()'](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text),['.string'](http: //www.crummy.com/software/BeautifulSoup/bs4/doc/#string) – Nicolas

+0

不起作用。你可以在更新版本的XML上再試一次嗎? – Thorn

1
>>> t = '''<?xml ...''' 
>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(t) 
>>> soup.find('head') 
<head> explains </head> 
>>> _.text 
' explains ' 

由於您使用的是有效的XML結構,你也可以使用不同的XML解析器,像ElementTree的:

>>> from xml.etree import ElementTree 
>>> tree = ElementTree.fromstring(t) 
>>> tree.find('.//head') 
<Element 'head' at 0x00000000031226D8> 
>>> _.text 
' explains ' 
+0

它不適合我。也許我會添加我的xml文件的更多部分。 – Thorn

+0

什麼不起作用?你有錯誤嗎?你使用的是什麼版本的Python? – poke

+0

我改變了一下xml。你可以再試一次嗎? Python 2.7,bs4 4.1.3 – Thorn