2010-07-01 96 views
0

我已經撰寫了以下試用代碼,以從歐洲議會撤回立法行爲的標題。使用BeautifulSoup抓取數據的問題

import urllib2 
from BeautifulSoup import BeautifulSoup 

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN" 

for number in xrange(1,10): 
    url = search_url % number 
    page = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(page) 
    title = soup.findAll("title") 
    print title 

但是,每當我運行它,我得到以下錯誤:

Traceback (most recent call last): 
    File "<stdin>", line 20, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128) 

我已經把範圍縮小到BeautifulSoup不能夠在循環讀取第四文檔。任何人都可以向我解釋我做錯了什麼?

隨着親切的問候

托馬斯

回答

1

更換

​​

for t in title: 
    print(t) 

print('\n'.join(t.string for t in title)) 

的作品。我不完全確定爲什麼print <somelist>有時會起作用,有時不會。

+0

親愛Unutbu,謝謝你的提示,我都工作。奇怪... – 2010-07-02 08:39:49

4

BeautifulSoup在Unicode中工作,所以它不會對解碼錯誤負責。更可能的是,您的問題與print聲明一起出現 - 您的標準輸出似乎以ascii(即sys.stdout.encoding = 'ascii'或缺失)顯示,因此如果嘗試打印包含非ascii字符的字符串,確實會發生此類錯誤。

什麼是您的操作系統?你的主機AKA終端設備如何(例如,如果在Windows上有什麼「代碼頁」)?你是否設置了環境PYTHONIOENCODING來控制sys.stdout.encoding或者你只是希望編碼會自動拾取?

我的Mac上,其中編碼正確的檢測,運行代碼(保存也與每個標題打印數量一起,爲清楚起見)工作正常,並表示:

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>] 
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>] 
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>] 
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>] 
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>] 
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>] 
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>] 
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>] 
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>] 
$ 
+0

嗨,亞歷克斯,我確實使用Mac,你如何設置你的?現在我只是希望編碼會自動拾取(我仍然在學習這整個令人困惑的編碼業務:)) – 2010-07-01 14:51:19

+0

@Thomas,我沒有做任何設置 - 開箱即用(utf8是默認的對於Terminal.App,我相信 - 如果沒有,那麼這是我在終端的首選項中設置的唯一東西)。什麼是你的Python中的'sys.stdout.encoding'(的確,你的Python和MacOSX是什麼?我有OSX 10.5,它可以與Apple分發的Python 2.5和python.org分發2.4,2.6和3。1 - 全部開箱並且沒有環境變量設置)。 – 2010-07-01 15:11:46

+0

嗨亞歷克斯,我使用MacOSx 10.5.8和Python 2.6。 – 2010-07-01 15:23:45

0

如果你想要將標題打印到文件中,您需要指定一些可以表示非ASCII字符的編碼,utf8應該可以正常工作。要做到這一點,你需要添加:

out = codecs.open('titles.txt', 'w', 'utf8') 

在腳本

的頂部,並打印到文件:

print >> out, title 
+0

嗨馬爾蒂尤夫,感謝您的幫助,但它仍然給我同樣的錯誤。 – 2010-07-01 23:02:54