提取屬性，其與BeautifulSoup4

樣品輸入XML標籤：提取屬性，其與BeautifulSoup4

<subj code1="textA" code2="textB" code3="textC"> 
    <txt count="1"> 
     <txt id="123"> 
      This is my text. 
     </txt> 
    </txt> 
</subj>

我與BeautifulSoup嘗試從中提取的XML信息到CSV。我所需的輸出是

code1,code2,code3,txt 
textA,textB,textC,This is my text.

我一直在玩這個示例代碼，我發現here：它的工作原理中關於提取txt但不是在代碼1，代碼2，CODE3在標籤subj。

if __name__ == '__main__': 
    with open('sample.csv', 'w') as fhandle: 
     writer = csv.writer(fhandle) 
     writer.writerow(('code1', 'code2', 'code3', 'text')) 
     for subj in soup.find_all('subj'): 
      for x in subj: 
       writer.writerow((subj.code1.text, 
           subj.code2.text, 
           subj.code3.text, 
           subj.txt.txt))

，但是，我不能讓它也承認subj，我要提取的屬性。有什麼建議嗎？

來源

2017-06-30 owwoow14

code1,code2和code3不是文字，它們是屬性。

爲了訪問它們，treat an element as a dictionary：

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))

演示：

In [1]: from bs4 import BeautifulSoup 

In [2]: data = """ 
    ...: <subj code1="textA" code2="textB" code3="textC"> 
    ...:  <txt count="1"> 
    ...:   <txt id="123"> 
    ...:    This is my text. 
    ...:   </txt> 
    ...:  </txt> 
    ...: </subj> 
    ...: """ 

In [3]: soup = BeautifulSoup(data, "xml") 
In [4]: for subj in soup('subj'): 
    ...:  print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)]) 
['textA', 'textB', 'textC', 'This is my text.']

您還可以使用.get()提供一個默認值，如果一個屬性是丟失：

subj.get('code1', 'Default value for code1')

來源

2017-06-30 13:37:09 alecxe

提取屬性，其與BeautifulSoup4

回答

相關問題