2014-03-06 77 views
5

我有div標籤內的一堆div標籤:如何在美麗的湯中選擇div類內的div?

<div class="foo"> 
    <div class="bar">I want this</div> 
    <div class="unwanted">Not this</div> 
</div> 
<div class="bar">Don't want this either 
</div> 

所以我使用python,美麗的湯,以獨立的東西了。只有當它被包裝在「foo」class div中時,我才需要所有「bar」類。這裏是我的代碼

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open(r'C:\test.htm')) 
tag = soup.div 
for each_div in soup.findAll('div',{'class':'foo'}): 
    print(tag["bar"]).encode("utf-8") 

或者,我想:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open(r'C:\test.htm')) 
for each_div in soup.findAll('div',{'class':'foo'}): 
    print(each_div.findAll('div',{'class':'bar'})).encode("utf-8") 

我在做什麼錯?如果我可以從選擇中刪除div類「不需要」,我只會對簡單打印(each_div)感到滿意。

回答

8

您可以使用find_all()<div>元素與foo作爲屬性搜索,併爲他們的每一個使用find()對於那些bar的屬性,如:

from bs4 import BeautifulSoup 
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html') 
for foo in soup.find_all('div', attrs={'class': 'foo'}): 
    bar = foo.find('div', attrs={'class': 'bar'}) 
    print(bar.text) 

運行它想:

python3 script.py htmlfile 

得出:

I want this 

UPDATE:假設有可能存在幾個<div>元素與bar屬性,前面的腳本將無法工作。它只會找到第一個。但是,你可以得到他們的子孫和他們重複,如:

from bs4 import BeautifulSoup 
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html') 
for foo in soup.find_all('div', attrs={'class': 'foo'}): 
    foo_descendants = foo.descendants 
    for d in foo_descendants: 
     if d.name == 'div' and d.get('class', '') == ['bar']: 
      print(d.text) 

有了這樣的輸入:

<div class="foo"> 
    <div class="bar">I want this</div> 
    <div class="unwanted">Not this</div> 
    <div class="bar">Also want this</div> 
</div> 

這將產生:

I want this 
Also want this