2013-12-13 88 views
0

類似的問題在這裏(Python XML Parsing)問,但我不能達到我感興趣的內容。如何分析其XML字符串深層結構使用python

我需要提取所有的之間封閉的信息標籤patent-classification如果classification-scheme標籤值爲CPC。有多個這樣的元素,幷包含在patent-classifications標籤內。

在下面的例子中給出的,有三個這樣的價值觀:C 07 K 16 22 IA 61 K 2039 505 AC 07 K 2317 21 A

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="21"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1"> 
      <bibliographic-data> 
       <publication-reference> 
        <document-id document-id-type="docdb"> 
         <country>US</country> 
         <doc-number>2009234106</doc-number> 
         <kind>A1</kind> 
         <date>20090917</date> 
        </document-id> 
        <document-id document-id-type="epodoc"> 
         <doc-number>US2009234106</doc-number> 
         <date>20090917</date> 
        </document-id> 
       </publication-reference> 
       <classifications-ipcr> 
        <classification-ipcr sequence="1"> 
         <text>C07K 16/ 44   A I     </text> 
        </classification-ipcr> 
       </classifications-ipcr> 
       <patent-classifications> 
        <patent-classification sequence="1"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>C</section> 
         <class>07</class> 
         <subclass>K</subclass> 
         <main-group>16</main-group> 
         <subgroup>22</subgroup> 
         <classification-value>I</classification-value> 
        </patent-classification> 
        <patent-classification sequence="2"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>A</section> 
         <class>61</class> 
         <subclass>K</subclass> 
         <main-group>2039</main-group> 
         <subgroup>505</subgroup> 
         <classification-value>A</classification-value> 
        </patent-classification> 
        <patent-classification sequence="7"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>C</section> 
         <class>07</class> 
         <subclass>K</subclass> 
         <main-group>2317</main-group> 
         <subgroup>92</subgroup> 
         <classification-value>A</classification-value> 
        </patent-classification> 
        <patent-classification sequence="1"> 
         <classification-scheme office="US" scheme="UC"/> 
         <classification-symbol>530/387.9</classification-symbol> 
        </patent-classification> 
       </patent-classifications> 
      </bibliographic-data> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data> 

回答

1

你可以使用python xml標準模塊:

import xml.etree.ElementTree as ET 

root = ET.parse('a.xml').getroot() 

for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[@scheme='CPC']/.."): 
    data = [] 
    for d in node.getchildren(): 
     if d.text: 
      data.append(d.text) 
    print ' '.join(data) 
2

安裝BeautifulSoup如果你沒有它:

$ easy_install BeautifulSoup4

試試這個:

from bs4 import BeautifulSoup 

xml = open('example.xml', 'rb').read() 
bs = BeautifulSoup(xml) 

# find patent-classification 
patents = bs.findAll('patent-classification') 
# filter the ones with CPC 
for pa in patents: 
    if pa.find('classification-scheme', {'scheme': 'CPC'}): 
     print pa.getText() 
+1

謝謝,但'xml'被用作變量? – user1140126

+0

well xml變量是你加載你的xml的地方。實際上,要嘗試確切的代碼,創建一個文件名'example.xml'並在其中寫入你在問題中發佈的內容,然後編輯我的答案,我缺少一行。謝謝 – PepperoniPizza

+0

@ user1140126再次檢查答案我更新了它。我錯過了一條線 – PepperoniPizza