2012-11-21 40 views
2

例子:如何選擇一切,但使用beautifulsoup4某個html元素?

import bs4 

html = ''' 
<div class="short-description std "> 
<em>Android Apps Security</em> provides guiding principles for how to 
best design and develop Android apps with security in mind. The book explores 
techniques that developers can use to build additional layers of security into 
their apps beyond the security controls provided by Android itself.    
<p class="scroll-down">∨ <a href="#main-desc" onclick="Effect.ScrollTo(
'main-desc', { duration:'0.2'}); return false;">Full Description</a> ∨</p></div> 
''' 
soup = bs4.BeautifulSoup(html) 

我如何從soup以下(一beautifulsoup對象)?

<div class="short-description std "> 
<em>Android Apps Security</em> provides guiding principles for how to 
best design and develop Android apps with security in mind. The book explores 
techniques that developers can use to build additional layers of security into 
their apps beyond the security controls provided by Android itself.    
</div> 

回答

4

簡單地進行搜索:

soup.find('p', class_='scroll-down') 

我使用了類限制查找,但因爲沒有其他p元素,這是一個有點多餘這裏。

相反,如果你需要刪除標記,用上面的方法先找到它,然後在其上調用.extract()從文件中刪除:

>>> soup.find('p', class_='scroll-down').extract() 
<p class="scroll-down"> <a href="#main-desc" onclick="Effect.ScrollTo(
'main-desc', { duration:'0.2'}); return false;">Full Description</a> </p> 
>>> print soup 

<div class="short-description std "> 
<em>Android Apps Security</em> provides guiding principles for how to 
best design and develop Android apps with security in mind. The book explores 
techniques that developers can use to build additional layers of security into 
their apps beyond the security controls provided by Android itself.    
</div> 

兩件事情:刪除的標籤從.extract()方法返回,您可以保存它以備後用。標籤完全從文檔中刪除,如果您仍然需要它在文檔中,則必須稍後手動重新添加標籤。

或者,您可以使用.decompose() method,它將完全刪除文檔中的標籤,而不返回引用。標籤然後永遠消失。

+0

對不起Martijn,標題中的問題是正確的,但我錯過了這個例子中的問題。我編輯它,它應該是**除了'p'元素**,而不是**''p'元素**。你的答案只是獲得'p'元素。 – Bentley4

+0

@ Bentley4:關於你期望的結果和你想要對剩下的部分做什麼,這個問題仍然是模棱兩可的。您的問題仍然不正確,要過濾* out *,但'p'元素只能選擇'p'元素。我懷疑你想要另一種方式。 –

+0

我用*過濾*作爲* select *的同義詞。所以我想要的是選擇除了'p'元素之外的所有東西。我改變了過濾器在問題中選擇。 – Bentley4