0
我在抓取網頁中的某些字段時遇到了一些麻煩。我在下面的代碼中執行了前兩個for循環,但是我遇到了最後一個for循環的困難。用BeautifulSoup颳去
from bs4 import BeautifulSoup
import urllib2
url="https://www.mturk.com/mturk/findhits?match=false"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
requesters=soup.findAll('span',{'class':'requesterIdentity'})
for eachrequester in requesters:
print "Requester Name: "+eachrequester.string
rewards=soup.findAll('span',{'class':'reward'})
for eachreward in rewards:
print "Reward: "+eachreward.string
hitnames=soup.findAll('a',{'class':'capsulelink'}) #THE ISSUE IS IN THESE 3 LINES
for eachhitname in hitnames:
print "Hit Name: "+eachhitname.string
代碼目前輸出:
Requester Name: Andrew Ryan
Requester Name: Vishwanath Kumar
Requester Name: rohzit0d
Requester Name: Jon Brelig
Requester Name: Tagasauris
Requester Name: Tagasauris
Requester Name: Tagasauris
Requester Name: CopyText Inc.
Requester Name: Tagasauris
Requester Name: Amazon Requester Inc.
Reward: $0.24
Reward: $0.03
Reward: $0.00
Reward: $0.05
Reward: $0.04
Reward: $0.02
Reward: $0.02
Reward: $0.01
Reward: $0.04
Reward: $0.00
Traceback (most recent call last):
File "C:/Users/admin/Desktop/pythonimageret/hitgwt.py", line 19, in <module>
print "Hit Name: "+eachhitname.string
TypeError: cannot concatenate 'str' and 'NoneType' objects
我意識到,腳本不能在這裏找到的HTML內容。該HTML看起來像:
<a class="capsulelink" href="#" id="capsule6-0">
Indoors or Out?
<span class="tags"></span>
</a>
我想是因爲href="#" id="capsule6-0"
都在class=""
和>
感謝這麼多完美的工作,我明白爲什麼。 (我不能upvote我沒有足夠好的聲譽呢) – Rorschach 2014-10-31 16:21:33