2017-06-30 76 views
1

樣品輸入XML標籤:提取屬性,其與BeautifulSoup4

<subj code1="textA" code2="textB" code3="textC"> 
    <txt count="1"> 
     <txt id="123"> 
      This is my text. 
     </txt> 
    </txt> 
</subj> 

我與BeautifulSoup嘗試從中提取的XML信息到CSV。 我所需的輸出是

code1,code2,code3,txt 
textA,textB,textC,This is my text. 

我一直在玩這個示例代碼,我發現here: 它的工作原理中關於提取txt但不是在代碼1,代碼2,CODE3在標籤subj

if __name__ == '__main__': 
    with open('sample.csv', 'w') as fhandle: 
     writer = csv.writer(fhandle) 
     writer.writerow(('code1', 'code2', 'code3', 'text')) 
     for subj in soup.find_all('subj'): 
      for x in subj: 
       writer.writerow((subj.code1.text, 
           subj.code2.text, 
           subj.code3.text, 
           subj.txt.txt)) 

,但是,我不能讓它也承認subj,我要提取的屬性。 有什麼建議嗎?

回答

1

code1,code2code3不是文字,它們是屬性

爲了訪問它們,treat an element as a dictionary

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True))) 

演示:

In [1]: from bs4 import BeautifulSoup 

In [2]: data = """ 
    ...: <subj code1="textA" code2="textB" code3="textC"> 
    ...:  <txt count="1"> 
    ...:   <txt id="123"> 
    ...:    This is my text. 
    ...:   </txt> 
    ...:  </txt> 
    ...: </subj> 
    ...: """ 

In [3]: soup = BeautifulSoup(data, "xml") 
In [4]: for subj in soup('subj'): 
    ...:  print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)]) 
['textA', 'textB', 'textC', 'This is my text.'] 

您還可以使用.get()提供一個默認值,如果一個屬性是丟失:

subj.get('code1', 'Default value for code1')