2016-07-14 148 views
-1

我想解析從EPO-OPS收到的這個簡單的文檔。爲什麼XML解析如此困難?

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="2"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1"> 
      <abstract lang="en"> 
       <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p> 
      </abstract> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data> 

我做

import xml.etree.ElementTree as ET 
root = ET.parse('pyre.xml').getroot() 
for child in root: 
    for kid in child: 
     for abst in kid: 
      for p in abst: 
       print (p.text) 

是否有類似的任何簡單的方法來JSON,如:

print (root.exchange-documents.exchange-document.abstract.p.text) 

回答

2

它與BeautifulSoup多容易得多。試試這個:

from bs4 import BeautifulSoup 

xml = """<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="2"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1"> 
      <abstract lang="en"> 
       <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p> 
      </abstract> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data>""" 

「龍」 的解決方案:

soup = BeautifulSoup(xml) 
for sub_cell_tag in soup.find_all('abstract'): 
    print(sub_cell_tag.text) 

如果你到一個襯墊:

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')])) 
+0

這beautifulsoup:https://pypi.python.org/pypi/beautifulsoup4? – Rahul

+0

@ Scripting.FileSystemObject就是這樣。 – poke

+0

是的,你可以在這裏找到它的文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc/ –

2

您可以使用XPath表達式與ElementTree的。需要注意的是,因爲你有xmlns定義的全局XML命名空間,你需要指定網址:

tree = ElementTree.parse(…) 

namespaces = { 'ns': 'http://www.epo.org/exchange' } 
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces) 
for paragraph in paragraphs: 
    print(paragraph.text) 
+0

我們不能通過使用getroot()來擺脫名稱空間嗎? – Rahul

+0

不,ElementTree在其核心內建有名稱空間,並且將(正確)尊重這些名稱空間。您可以在解析後移除命名空間[在本答案中討論](http://stackoverflow.com/a/25920989/216074),但沒有內置的解決方案可以忽略它們。 – poke