使用BeautifulSoup抓取數據的問題

我已經撰寫了以下試用代碼，以從歐洲議會撤回立法行爲的標題。使用BeautifulSoup抓取數據的問題

import urllib2 
from BeautifulSoup import BeautifulSoup 

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN" 

for number in xrange(1,10): 
    url = search_url % number 
    page = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(page) 
    title = soup.findAll("title") 
    print title

但是，每當我運行它，我得到以下錯誤：

Traceback (most recent call last): 
    File "<stdin>", line 20, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

我已經把範圍縮小到BeautifulSoup不能夠在循環讀取第四文檔。任何人都可以向我解釋我做錯了什麼？

隨着親切的問候

托馬斯

來源

2010-07-01 Thomas Jensen

更換

與

for t in title: 
    print(t)

或

print('\n'.join(t.string for t in title))

的作品。我不完全確定爲什麼print <somelist>有時會起作用，有時不會。

來源

2010-07-01 14:16:01 unutbu

親愛Unutbu，謝謝你的提示，我都工作。奇怪... – 2010-07-02 08:39:49

BeautifulSoup在Unicode中工作，所以它不會對解碼錯誤負責。更可能的是，您的問題與print聲明一起出現 - 您的標準輸出似乎以ascii（即sys.stdout.encoding = 'ascii'或缺失）顯示，因此如果嘗試打印包含非ascii字符的字符串，確實會發生此類錯誤。

什麼是您的操作系統？你的主機AKA終端設備如何（例如，如果在Windows上有什麼「代碼頁」）？你是否設置了環境PYTHONIOENCODING來控制sys.stdout.encoding或者你只是希望編碼會自動拾取？

我的Mac上，其中編碼是正確的檢測，運行代碼（保存也與每個標題打印數量一起，爲清楚起見）工作正常，並表示：

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>] 
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>] 
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>] 
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>] 
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>] 
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>] 
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>] 
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>] 
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>] 
$

來源

2010-07-01 14:16:37

嗨，亞歷克斯，我確實使用Mac，你如何設置你的？現在我只是希望編碼會自動拾取（我仍然在學習這整個令人困惑的編碼業務:)） – 2010-07-01 14:51:19

@Thomas，我沒有做任何設置 - 開箱即用（utf8是默認的對於Terminal.App，我相信 - 如果沒有，那麼這是我在終端的首選項中設置的唯一東西）。什麼是你的Python中的'sys.stdout.encoding'（的確，你的Python和MacOSX是什麼？我有OSX 10.5，它可以與Apple分發的Python 2.5和python.org分發2.4,2.6和3。1 - 全部開箱並且沒有環境變量設置）。 – 2010-07-01 15:11:46

嗨亞歷克斯，我使用MacOSx 10.5.8和Python 2.6。 – 2010-07-01 15:23:45

如果你想要將標題打印到文件中，您需要指定一些可以表示非ASCII字符的編碼，utf8應該可以正常工作。要做到這一點，你需要添加：

out = codecs.open('titles.txt', 'w', 'utf8')

在腳本

的頂部，並打印到文件：

print >> out, title

來源

2010-07-01 16:22:19 vpekar

嗨馬爾蒂尤夫，感謝您的幫助，但它仍然給我同樣的錯誤。 – 2010-07-01 23:02:54

使用BeautifulSoup抓取數據的問題

回答

相關問題