清理，並與BeautifulSoup

2010-06-30 21 views 0 likes

移除標籤，我有以下腳本至今：清理，並與BeautifulSoup

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 
import re 
import urllib2 

br = Browser() 
br.open("http://www.foo.com") 

html = br.response().read(); 

soup = BeautifulSoup(html) 
items = soup.findAll(id="info")

，它運行完美，結果在下面的「項目」：

<div id="info"> 
<span class="customer"><b>John Doe</b></span><br> 
123 Main Street<br> 
Phone:5551234<br> 
<b><span class="paid">YES</span></b> 
</div>

不過，我想借項目和清理，以獲得

John Doe 
123 Main Street 
5551234

你怎麼能雷莫BeautifulSoup和Python中有這樣的標籤嗎？

一如既往，謝謝！

來源

2010-06-30 Parker

回答

這將爲此EXACT html做到這一點。很顯然，這不能容忍任何偏差，因此您需要添加相當多的邊界檢查和空檢查，但下面是將數據轉換爲純文本的一些細節。

items = soup.findAll(id="info") 
print items[0].span.b.contents[0] 
print items[0].contents[3].strip() 
print items[0].contents[5].strip().split(":", 1)[1]

來源

2010-07-01 00:42:23

謝謝，彼得，這正是我所需要的！ – Parker 2010-07-01 11:37:03

相關問題

1. BeautifulSoup：進一步清理文章文字
2. 與beautifulsoup
3. 與beautifulsoup
4. 與BeautifulSoup
5. 與beautifulsoup
6. 與BeautifulSoup
7. 與BeautifulSoup
8. 查找並Beautifulsoup
9. 錯誤與讀取HTML代碼，並基於數據屬性與beautifulsoup beautifulsoup
10. 使用BeautifulSoup清理html文檔和多個段落