2011-09-25 70 views
2

我正在尋找解決方案來解決與python中的XML相關的問題。雖然頻譜不是根元素讓我們假設它是這個例子。XML:回溯父元素

<spectrum index="2" id="controller=0 scan=3" defaultArrayLength="485"> 
      <cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="2"/> 
      <cvParam cvRef="MS" accession="MS:1000580" name="MSn spectrum" value=""/> 
      <cvParam cvRef="MS" accession="MS:1000127" name="centroid mass spectrum" value=""/> 
      <precursorList count="1"> 
      <precursor spectrumRef="controller=0 scan=2"> 
       <isolationWindow> 
       <cvParam cvRef="MS" accession="MS:1000040" name="m/z" value="810.78999999999996"/> 
       <cvParam cvRef="MS" accession="MS:1000023" name="isolation width" value="2"/> 
       </isolationWindow> 
       <selectedIonList count="1"> 
       <selectedIon> 
        <cvParam cvRef="MS" accession="MS:1000040" name="m/z" value="810.78999999999996"/> 
       </selectedIon> 
       </selectedIonList> 
       <activation> 
       <cvParam cvRef="MS" accession="MS:1000133" name="collision-induced dissociation" value=""/> 
       <cvParam cvRef="MS" accession="MS:1000045" name="collision energy" value="35"/> 
       </activation> 
      </precursor> 
      </precursorList> 
      <binaryDataArrayList count="2"> 
      <binaryDataArray encodedLength="5176"> 
       <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" value=""/> 
       <cvParam cvRef="MS" accession="MS:1000576" name="no compression" value=""/> 
       <cvParam cvRef="MS" accession="MS:1000514" name="m/z array" value="" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/> 
       <binary>AAAAYHHsbEAAAADg3yptQAAAAECt7G1AAAAAAN8JbkAAAAAA.......hLJ==</binary> 
      </binaryDataArray> 
      <binaryDataArray encodedLength="2588"> 
       <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" value=""/> 
       <cvParam cvRef="MS" accession="MS:1000576" name="no compression" value=""/> 
       <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value=""/> 
       <binary>ZFzUQWmVo0FH/o9BRfUyQg+xjUOzkZdC5k66QWk6HUSpqyZCsV1NQ......uH=</binary> 
      </binaryDataArray> 
      </binaryDataArrayList> 
</spectrum> 

我想實現的是找到樹中的所有selectedIon元素,原路返回它的父元素譜。如果發現selectedIon元素,則

SelectedIon信息:


質量:810.78999999999996

Spectra Info: 
------------- 
index=2 
id=controller=0 
scan=3 
length=485 

General Info 
------------ 
ms level=2 
Msn spectrum= - 
centriod mass spectrum=- 
..................... 
And all the cvParam name and value as above. 

Binary 
------ 
m/z array = AAAAYHHsbEAAAADg3yptQAAAAECt7G1AAAA.....== 

intensity array = ZFzUQWmVo0FH/o9BRfUyQg+xjUOzkZdC5k66Q....5C77= 

我迄今爲止嘗試:

import xml.etree.ElementTree as ET 
tree=ET.parse('file.mzml') 
NS="{http://psi.hupo.org/ms/mzml}" 
filesource=tree.findall('.//'+NS+'selectedIon') # Will get all selectedIon element from the tree 

現在我該如何回溯到頻譜元素/子元素來解析出上面的相關信息呢?

我該如何成功?

+0

你爲什麼不走另一條路?即循環頻譜元素,如果它具有selectedIon元素,則輸出。 – Avaris

+0

我想解析只有選定的頻譜元素。以其他方式將加載所有可能未被選中的頻譜元素。 – thchand

+0

當然,但如果是這種情況,您可以跳過該頻譜元素並轉到下一個頻譜元素。 – Avaris

回答

1

XPath會讓你訪問一個祖先:「ancestor :: spectrum」將返回你所包含的<spectrum>元素。如果您使用lxml,則可以使用完整的XPath語法來查找所需的元素。

from lxml import etree 
tree = etree.XML('file.mzml') 
NS = "{http://psi.hupo.org/ms/mzml}" 
filesource = tree.findall('.//'+NS+'selectedIon') 
spectrum = filesource.xpath('ancestor::spectrum')[0] 

(我想,沒有測試...)

更新:代碼的實際工作:

from lxml import etree 

tree = etree.parse('foo.xml') 
for el in tree.findall(".//selectedIon"): 
    for top in el.xpath("ancestor::spectrum"): 
     print top 
+1

嗯,xpath是一個基於'xpath('ancestor :: spectrum')[1]'。你也可以直接選擇所有已選擇兒童的頻譜:'// spectrum [.// selectedIon]' –

+0

我認爲filesource = tree.findall('.//'+ NS +'selectedIon')創建列表和列表沒有屬性xpath – thchand

+0

@ 42://spectrum[.//selectedIon]表達式將選擇所有selectedIon。但同樣的問題如何解析頻譜和其他頻譜元素的信息? – thchand

0

如果這仍是本期,你可以嘗試pymzML,蟒蛇接口到mzML文件。

所有二級質譜打印所有信息都是一樣容易:

import pymzml 
msrun = pymzml.run.Reader("your-file.mzML") 
for spectrum in msrun: 
    if spectrum['ms level'] == 2: 
     # spectrum is a dict, so you can just print it   
     print(spectrum) 

(披露:我是作者之一)