2017-09-24 51 views
1

我試圖從一個XML文件中導入數據,該文件包含來自運動測試的呼吸數據。 的XML結構如下(簡化以示出的一般結構):解析XML並獲取數據到熊貓數據框的問題

<?xml version="1.0"?> 
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" 
    xmlns:o="urn:schemas-microsoft-com:office:office" 
    xmlns:x="urn:schemas-microsoft-com:office:excel" 
    xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" 
    xmlns:html="http://www.w3.org/TR/REC-html40"> 
    <Worksheet ss:Name="MetasoftStudio"> 
     <Table ss:ExpandedColumnCount="21" ss:ExpandedRowCount="458" x:FullColumns="1" x:FullRows="1" ss:StyleID="s62" ss:DefaultColumnWidth="53"> 
      <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/> 
      <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="97"/> 
      <Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/> 
      <Row ss:AutoFitHeight="0" ss:Height="26"> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">t</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">Phase</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">Marker</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/kg</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/HR</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">HR</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">WR</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'O2</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'CO2</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">RER</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">V'E</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">BF</Data></Cell> 
      </Row> 
      <Row ss:Height="15"> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">h:mm:ss</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">ml/min/kg</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">ml</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">W</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell> 
      </Row> 
      <Row ss:Height="15"> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">0:00:06</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String">Rest</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">0.27972413565454501</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">4.3706896196022598</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">4.5856415681072953</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">61</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">0</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">27.002532271037801</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">26.4113108545688</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">1.0223851598932201</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">10.155340000000001</Data></Cell> 
       <Cell ss:StyleID="Default"><Data ss:Type="Number">18.07</Data></Cell> 
      </Row> 
     </Table> 
    </Worksheet> 
</Workbook> 

我已經使用lxml解析和遍歷XML文件,然後在每一個「細胞」中提取的「數據」追加到一個列表,然後追加該列表父列表使用的代碼(給我的每一行的嵌套列表):

from lxml import etree, objectify 
import pandas as pd 

with open('Python/cortex.xml') as infile: 
    xml_file = infile.read() 

    root = objectify.fromstring(xml_file) 

    header = [] 
    data = [] 

    for row in root.Worksheet.Table.getchildren(): 
     temp_row = [] 
     if not row.tag == '{urn:schemas-microsoft-com:office:spreadsheet}Column': 
      for cell in row.getchildren(): 
       temp_row.append(cell.Data) 
      data.append(temp_row) 
    header = data.pop(0) #remove the first 'row' and store in header list 
    del data[0] #remove 2nd line of superfluous data 

第一行給出了頭,於是我pop是到自己的名單,以及行2包含每個變量的單位,所以我只是擺脫了這一點。到目前爲止,所有的工作都很好(或者看起來似乎如此)...

現在我需要將它放入一個pd數據框以開始使用它。如果我去df = pd.DataFrame(data, columns=header)和我print(df)我得到: ValueError: Buffer has wrong number of dimensions (expected 1, got 32)

好了不知道那裏發生了什麼......如果我讓DF不分配頭和打印,我得到:

   0   1  2      3 \ 
0 [[[0:00:06]]] [[[Rest]]] [[[]]] [[[0.279724135654545]]] 
1 [[[0:00:09]]] [[[Rest]]] [[[]]] [[[0.465136232899829]]] 
2 [[[0:00:13]]] [[[Rest]]] [[[]]] [[[0.357975433456662]]] 
3 [[[0:00:19]]] [[[Rest]]] [[[]]] [[[0.543332419057909]]] 
4 [[[0:00:24]]] [[[Rest]]] [[[]]] [[[0.374604578743889]]] 

那並不是」你看起來不錯!列表中列出的所有這些列表是從哪裏來的!如果我迭代並打印嵌套列表data,它可以完美打印,但是一旦我嘗試將其轉換爲df出現問題時。

任何人都可以啓發我發生了什麼,以及如何獲得數據到PDDF?如果有更好的方法比我做得更好,那麼我很樂意放棄它。

回答

0

您可以通過構造函數創建列表清單,然後創建DataFrame。爲了解析使用this solution

from lxml import etree 

with (open('test.xml','r')) as f: 
    doc = etree.parse(f) 

namespaces={'o':'urn:schemas-microsoft-com:office:office', 
      'x':'urn:schemas-microsoft-com:office:excel', 
      'ss':'urn:schemas-microsoft-com:office:spreadsheet'} 

L = [] 
ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces) 
if len(ws) > 0: 
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces) 
    if len(tables) > 0: 
     rows = tables[0].xpath('./ss:Row', namespaces=namespaces) 
     for row in rows: 
      tmp = [] 
      cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces) 
      for cell in cells: 
#    print(cell.text); 
       tmp.append(cell.text) 
      L.append(tmp) 
print (L) 

[['t', 'Phase', 'Marker', "V'O2", "V'O2/kg", "V'O2/HR", 'HR', 'WR', 
    "V'E/V'O2", "V'E/V'CO2", 'RER', "V'E", 'BF'], 
['h:mm:ss', None, None, 'L/min', 'ml/min/kg', 'ml', 
'/min', 'W', None, None, None, 'L/min', '/min'], 
['0:00:06', 'Rest', None, '0.27972413565454501', '4.3706896196022598', 
    '4.5856415681072953', '61', '0', '27.002532271037801', '26.4113108545688', 
    '1.0223851598932201', '10.155340000000001', '18.07']] 

df = pd.DataFrame(L[2:], columns=L[0]) 
print (df) 
     t Phase Marker     V'O2    V'O2/kg \ 
0 0:00:06 Rest None 0.27972413565454501 4.3706896196022598 

       V'O2/HR HR WR   V'E/V'O2   V'E/V'CO2 \ 
0 4.5856415681072953 61 0 27.002532271037801 26.4113108545688 

        RER     V'E  BF 
0 1.0223851598932201 10.155340000000001 18.07 
+0

完美的作品,謝謝!我假設在XML文件中有多於1

,我需要修改'rows = tables [0] .xpath'以對應於我想要的
的索引? – Braden

+0

給我一點時間。但是有必要。 – jezrael

+0

對不起 - 只是意識到較大的文件沒有多個

標籤,但額外的標籤後面的數據表開始前標籤。我怎麼會忽略之前之後的行呢?如果它幫助完整的XML在這裏[鏈接](https://www.dropbox.com/s/plvlfj4avpodxfw/cortex_full.xml?dl=0)我想從XML的第641行開始的行 – Braden