數據提取和重新

2013-06-01 68 views 0 likes

我特林提取JB Hi-Fi spcify信息，這裏是我做過什麼：數據提取和重新

from BeautifulSoup import BeautifulSoup 
import urllib2 
import re 



url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go" 

page=urllib2.urlopen(url) 
soup = BeautifulSoup(page.read()) 
Item0=soup.findAll('td',{'class':'check_title'})[0]  
print (Item0.renderContents())

輸出爲：

Apple iPod Classic 160GB (Black)Â  
<span class="SKU">MC297ZP/A</span>

我要的是：

Apple iPod Classic 160GB (Black)

，我試圖重新使用去除其他信息

print(Item0.renderContents()).replace{^<span:,""}

，但它沒有工作

所以我的問題是我怎麼能去除無用信息，並獲得「蘋果iPod的經典160GB（黑色）」

來源

2013-06-01 Calvin Wu

回答

不要使用.renderContents();這是一個最好的調試工具。

剛拿到的第一個孩子：

>>> Item0.contents[0] 
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t' 
>>> Item0.contents[0].strip() 
u'Apple iPod Classic 160GB (Black)\xc2'

看來BeautifulSoup還沒有完全猜中的編碼，所以不換行空格（U + 00A0）存在以兩個字節而不是一個。它看起來像BeautifulSoup猜錯了：

>>> soup.originalEncoding 
'iso-8859-1'

您可以通過使用響應標頭強制編碼;該服務器沒有設置字符集：

>>> page.info().getparam('charset') 
'utf-8' 
>>> page=urllib2.urlopen(url) 
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset')) 
>>> Item0=soup.findAll('td',{'class':'check_title'})[0] 
>>> Item0.contents[0].strip() 
u'Apple iPod Classic 160GB (Black)'

的fromEncoding參數告訴BeautifulSoup使用UTF-8，而不是拉丁語1，現在的非換空間被正確地剝離。

來源

2013-06-01 10:38:18

相關問題

11. mysql和php：數據提取
12. mysqli和提取數據
13. 數據庫表中提取和鏈接提取的數據
14. XML解析和存儲數據並重新獲取數據
15. 提取對角線數據的和重組的數量
16. 停止阿賈克斯從重新提取數據
17. 從ajax提取數據後重新加載頁面
18. JavaFX從數據庫中重新提取observablelist
19. React - Redux獲取數據onclick並重新提交
20. MYSQL重新提取錯誤
21. 用於值提取，分割數據和重新格式化的python腳本
22. 更改NSFetchedResultsController的提取請求和重新加載表數據的食譜
23. 提取數據，並創建新的表
24. Python中提取新的數據幀
25. 提取數據
26. 提取數據
27. 提取數據
28. 數據提取
29. 提取數據
30. 提取數據