解析在python具有多個根元素的XML文件

我有一個XML文件，並我需要取一些從它的標籤爲一些使用，其具有數據等：解析在python具有多個根元素的XML文件

<?xml version="1.0"?> 
<data> 
    <country name="Liechtenstein"> 
     <rank>1</rank> 
     <year>2008</year> 
     <gdppc>141100</gdppc> 
     <neighbor name="Austria" direction="E"/> 
     <neighbor name="Switzerland" direction="W"/> 
    </country> 
    <country name="Singapore"> 
     <rank>4</rank> 
     <year>2011</year> 
     <gdppc>59900</gdppc> 
     <neighbor name="Malaysia" direction="N"/> 
    </country> 
    <country name="Panama"> 
     <rank>68</rank> 
     <year>2011</year> 
     <gdppc>13600</gdppc> 
     <neighbor name="Costa Rica" direction="W"/> 
     <neighbor name="Colombia" direction="E"/> 
    </country> 
</data> 
<?xml version="1.0"?> 
<data> 
    <country name="Liechtenstein1"> 
     <rank>1</rank> 
     <year>2008</year> 
     <gdppc>141100</gdppc> 
     <neighbor name="Austria1" direction="E"/> 
     <neighbor name="Switzerland1" direction="W"/> 
    </country> 
    <country name="Singapore"> 
     <rank>4</rank> 
     <year>2011</year> 
     <gdppc>59900</gdppc> 
     <neighbor name="Malaysia1" direction="N"/> 
    </country> 
    <country name="Panama"> 
     <rank>68</rank> 
     <year>2011</year> 
     <gdppc>13600</gdppc> 
     <neighbor name="Costa Rica" direction="W"/> 
     <neighbor name="Colombia" direction="E"/> 
    </country> 
</data>

我需要解析這一點，所以我用：

這個代碼在第2行給錯誤：xml.etree.ElementTree.ParseError: junk after document element:

我想這是因爲多個XML標記，你有什麼想法，我應該如何解析呢？

來源

2017-08-03 ggupta

「我有一個XML文件......」不，你別。文件從哪裏來？有沒有可能解決這方面的問題？（解析它不應該太難，但是如果有什麼辦法可以避免無效的XML，那就更好了。） – smarx

它們不是一個有效的XML文件。但是你可以在'<？xml version =「1.0」？>''之前將它分開，然後分別解析這些部分。 –

@smarx你是什麼意思'有沒有可能......「？我只給出了文件中的示例數據，它確實包含更多像這樣的根元素...... @KlausD。尋找更好的選擇。 – ggupta

如果您需要，此代碼將填寫一種方法的詳細信息。

該代碼監視'cumulative_xml直到遇到另一個xml文檔的開始或文件的結尾。當它有一個完整的xml文檔時，它會調用display來執行lxml庫來解析文檔並報告一些內容。

>>> from lxml import etree 
>>> def display(alist): 
...  tree = etree.fromstring(''.join(alist)) 
...  for country in tree.xpath('.//country'): 
...   print(country.attrib['name'], country.find('rank').text, country.find('year').text) 
...   print([neighbour.attrib['name'] for neighbour in country.xpath('neighbor')]) 
... 
>>> accumulated_xml = [] 
>>> with open('temp.xml') as temp: 
...  while True: 
...   line = temp.readline() 
...   if line: 
...    if line.startswith('<?xml'): 
...     if accumulated_xml: 
...      display (accumulated_xml) 
...      accumulated_xml = [] 
...    else: 
...     accumulated_xml.append(line.strip()) 
...   else: 
...    display (accumulated_xml) 
...    break 
... 
Liechtenstein 1 2008 
['Austria', 'Switzerland'] 
Singapore 4 2011 
['Malaysia'] 
Panama 68 2011 
['Costa Rica', 'Colombia'] 
Liechtenstein1 1 2008 
['Austria1', 'Switzerland1'] 
Singapore 4 2011 
['Malaysia1'] 
Panama 68 2011 
['Costa Rica', 'Colombia']

來源

2017-08-03 18:22:30

感謝您的這一點，我只是使用相同的方法，想知道沒有這樣的python庫。 – ggupta

每當我使用這種分割文件的方式時，我認爲在Python中必須有更好的表達方式。 –

問題：...任何想法，我應該如何解析呢？

篩選整個文件並分割成有效<?xml ...塊。
創建myfile_01, myfile_02 ... myfile_nn。

n = 0 
out_fh = None 
with open('myfile.xml') as in_fh: 
    while True: 
     line = in_fh.readline() 
     if not line: break 

     if line.startswith('<?xml'): 
      if out_fh: 
       out_fh.close() 
      n += 1 
      out_fh = open('myfile_{:02}'.format(n)) 

     out_fh.write(line) 

    out_fh.close()

如果你想在所有一個<country>XML Tree：

import re 
from xml.etree import ElementTree as ET 

with open('myfile.xml') as fh: 
    root = ET.fromstring('<?xml version="1.0"?><data>{}</data>'. 
         format(''.join(re.findall('<country.*?</country>', fh.read(), re.S))) 
           )

測試使用Python 3.4.2

來源

2017-08-03 20:33:53 stovfl

感謝您的建議，使用了相同的方法。謝謝 – ggupta

我只是找到解析文件的方式，而不是任何特定的標籤，以前的答案對我很有幫助，謝謝修改。 – ggupta

解析在python具有多個根元素的XML文件

回答

相關問題