2017-05-05 73 views
1

我對python非常陌生。我有這個非常大的XML文件,我想從中提取一些數據。下面是摘錄:解析許多兒童和孫子的XML文件

<program> 
    <id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id> 
    <programID>3853</programID> 
    <orchestra>New York Philharmonic</orchestra> 
    <season>1842-43</season> 
    <concertInfo> 
     <eventType>Subscription Season</eventType> 
     <Location>Manhattan, NY</Location> 
     <Venue>Apollo Rooms</Venue> 
     <Date>1842-12-07T05:00:00Z</Date> 
     <Time>8:00PM</Time> 
    </concertInfo> 
    <worksInfo> 
     <work ID="52446*"> 
      <composerName>Beethoven, Ludwig van</composerName> 
      <workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle> 
      <conductorName>Hill, Ureli Corelli</conductorName> 
     </work> 
     <work ID="8834*4"> 
      <composerName>Weber, Carl Maria Von</composerName> 
      <workTitle>OBERON</workTitle> 
      <movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="3642*"> 
      <composerName>Hummel, Johann</composerName> 
      <workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle> 
      <soloists> 
       <soloist> 
        <soloistName>Scharfenberg, William</soloistName> 
        <soloistInstrument>Piano</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Hill, Ureli Corelli</soloistName> 
        <soloistInstrument>Violin</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Derwort, G. H.</soloistName> 
        <soloistInstrument>Viola</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Boucher, Alfred</soloistName> 
        <soloistInstrument>Cello</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Rosier, F. W.</soloistName> 
        <soloistInstrument>Contrabass</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="0*"> 
      <interval>Intermission</interval> 
     </work> 
     <work ID="8834*3"> 
      <composerName>Weber, Carl Maria Von</composerName> 
      <workTitle>OBERON</workTitle> 
      <movement>Overture</movement> 
      <conductorName>Etienne, Denis G.</conductorName> 
     </work> 
     <work ID="8835*1"> 
      <composerName>Rossini, Gioachino</composerName> 
      <workTitle>ARMIDA</workTitle> 
      <movement>Duet</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Horn, Charles Edward</soloistName> 
        <soloistInstrument>Tenor</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="8837*6"> 
      <composerName>Beethoven, Ludwig van</composerName> 
      <workTitle>FIDELIO, OP. 72</workTitle> 
      <movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Horn, Charles Edward</soloistName> 
        <soloistInstrument>Tenor</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="8336*4"> 
      <composerName>Mozart, Wolfgang Amadeus</composerName> 
      <workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle> 
      <movement>"Ach Ich liebte," Konstanze (aria)</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="5543*"> 
      <composerName>Kalliwoda, Johann W.</composerName> 
      <workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle> 
      <conductorName>Timm, Henry C.</conductorName> 
     </work> 
    </worksInfo> 
</program> 
<program> 

我想要做的是提取下列信息:programID,樂團,季節,事件類型,工作證,soloistName,solositInstrument,soloistRole

下面是代碼我正在使用:

import csv 
import xml.etree.cElementTree as ET 
tree = ET.iterparse('complete.xml.txt') 
#root = tree.getroot() 


for program in root.iter('program'): 
    ID = program.findtext('id') 
    programID = program.findtext('programID') 
    orchestra = program.findtext('orchestra') 
    season = program.findtext('season') 

    for concert in program.findall('concertInfo'): 
    event = concert.findtext('eventType') 

    for worksInfo in program.findall('worksInfo'): 
     for work in worksInfo.iter('work'): 
      workid = work.get('ID') 
      for soloists in work.iter('soloists'): 
       for soloist in soloists.iter('soloist'): 
        soloname = soloist.findtext('soloistName') 
        soloinstrument =                `soloist.findtext('soloistInstrument')` 
        solorole = soloist.findtext('soloistRoles') 
        #print(soloname, soloinstrument, solorole) 
      #print(workid) 
    #print(event)    
#print(programID , " , " , orchestra , " , " , season) 
with open("nyphil.txt","a") as nyphil: 
    nyphilwriter = csv.writer(nyphil) 
    nyphilwriter.writerow([programID, orchestra, season, event, workid, `soloname.encode('utf-8'), soloinstrument, solorole]) 
nyphil.close() 

當我運行此代碼時,我只獲取最後一個soloistName和soloistInstrumet。我想到的結果有點像對每個程序的重複觀察。所以我有這樣的:

13918,紐約愛樂樂團,1842年至1843年,認購季節,52446 *,奧托,安託瓦內特,女高音,S

13918,...,3642 *,夏芬伯格威廉,鋼琴,A

13918,...,3642 *,山,Ureli科雷利,小提琴,A

,並依此類推,直至最後一部作品ID:

13918,... 。,8336 * 4,奧托,安託瓦內特,女高音,S

我所得到的是隻有最後的工作:

13918,紐約愛樂樂團,1842年至1843年,認購季節,8336 *,奧托,安託瓦內特,女高音,S

在該文件中有超過15000個像我發佈的例子一樣。我想解析所有這些信息並提取上面提到的信息。我不完全確定如何去做這件事,我已經搜索了互聯網尋找方法來做到這一點,但我試過的一切都不起作用!

回答

0

這裏你的問題是你誤解了循環的工作方式。特別是,當你在循環是值只改變:

for x in range(10): 
    pass 

print(x) # prints 9 

VS

for x in range(10): 
    print(x) 

這是兩個不同的東西。你在做前者。你需要做的是這樣的:

with open('nyphil.txt', 'w') as f: 
    nyphilwriter = csv.writer(f)   
    for program in root.iter('program'): 
     id_ = program.findtext('id') 
     program_id = program.findtext('programID') 
     orchestra = program.findtext('orchestra') 
     season = program.findtext('season') 
     for concert in program.findall('concertInfo'): 
      event = concert.findtext('eventType') 
     for info in program.findall('worksInfo'): 
      for work in info.iter('work'): 
       work_id = work.get('ID') 
       for soloists in work.iter('soloists'): 
        for soloist in soloists.iter('soloist'): 
         # Change this line to whatever you want to write out 
         nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')]) 
+0

非常感謝你!這正是我需要的。我對這一切都很陌生,實際上循環的工作方式讓我感到非常困惑。這雖然是一個巨大的幫助,謝謝! –

+0

如果這個答案是最能解決你問題的答案,那麼你應該在左邊的複選標記處標記爲'<-----'。如果您發現它(以及其他答案)特別有用,您還可以用數字上方的三角形對其進行加註。 –

+0

嗨韋恩,我upvoted它,但我有不到15聲望,所以它沒有記錄:/但你的答案是非常有益的! –

0

13918沒有出現在你的數據中。拋開一邊,這是我寫的,它似乎能夠成功處理您的數據。

from lxml import etree 

tree = etree.parse('test.xml') 
programs = tree.xpath('.//program') 

for program in programs: 
    programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']] 
    print (programID, orchestra, season) 
    works = program.xpath('worksInfo/work') 
    for work in works: 
     workID = work.attrib['ID'] 
     soloistItems = work.xpath('soloists/soloist') 
     for soloistItem in soloistItems: 
      print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text) 

該腳本產生以下輸出。其他

3853 New York Philharmonic 1842-43 
8834*4 Otto, Antoinette Soprano S 
3642* Scharfenberg, William Piano A 
3642* Hill, Ureli Corelli Violin A 
3642* Derwort, G. H. Viola A 
3642* Boucher, Alfred Cello A 
3642* Rosier, F. W. Contrabass A 
8835*1 Otto, Antoinette Soprano S 
8835*1 Horn, Charles Edward Tenor S 
8837*6 Horn, Charles Edward Tenor S 
8336*4 Otto, Antoinette Soprano S 

有一點要注意:我把一個標籤在你的XML的開始和結束時,因爲真正的數據將包含多個元素。