2014-10-30 179 views
0

我在抓取網頁中的某些字段時遇到了一些麻煩。我在下面的代碼中執行了前兩個for循環,但是我遇到了最後一個for循環的困難。用BeautifulSoup颳去

from bs4 import BeautifulSoup 
import urllib2 
url="https://www.mturk.com/mturk/findhits?match=false" 
page=urllib2.urlopen(url) 
soup = BeautifulSoup(page.read()) 


requesters=soup.findAll('span',{'class':'requesterIdentity'}) 
for eachrequester in requesters: 
    print "Requester Name: "+eachrequester.string 

rewards=soup.findAll('span',{'class':'reward'}) 
for eachreward in rewards: 
    print "Reward: "+eachreward.string 


hitnames=soup.findAll('a',{'class':'capsulelink'}) #THE ISSUE IS IN THESE 3 LINES 
for eachhitname in hitnames: 
    print "Hit Name: "+eachhitname.string 

代碼目前輸出:

Requester Name: Andrew Ryan 
Requester Name: Vishwanath Kumar 
Requester Name: rohzit0d 
Requester Name: Jon Brelig 
Requester Name: Tagasauris 
Requester Name: Tagasauris 
Requester Name: Tagasauris 
Requester Name: CopyText Inc. 
Requester Name: Tagasauris 
Requester Name: Amazon Requester Inc. 
Reward: $0.24 
Reward: $0.03 
Reward: $0.00 
Reward: $0.05 
Reward: $0.04 
Reward: $0.02 
Reward: $0.02 
Reward: $0.01 
Reward: $0.04 
Reward: $0.00 

Traceback (most recent call last): 
    File "C:/Users/admin/Desktop/pythonimageret/hitgwt.py", line 19, in <module> 
    print "Hit Name: "+eachhitname.string 
TypeError: cannot concatenate 'str' and 'NoneType' objects 

我意識到,腳本不能在這裏找到的HTML內容。該HTML看起來像:

<a class="capsulelink" href="#" id="capsule6-0"> 
    Indoors or Out? 
    <span class="tags"></span> 
</a> 

我想是因爲href="#" id="capsule6-0"都在class="">

回答

0

eachhitname.string之間爲None,因爲該屬性只包含文本如果在當前元素沒有任何其他標記。每個鏈接都有一個<span class="tags"></span>元素。

改爲使用.text屬性;在str.strip()呼叫中添加以刪除多餘的空白:

hitnames = soup.findAll('a', {'class':'capsulelink'}) 
for eachhitname in hitnames: 
    print "Hit Name: " + eachhitname.text.strip() 
+0

感謝這麼多完美的工作,我明白爲什麼。 (我不能upvote我沒有足夠好的聲譽呢) – Rorschach 2014-10-31 16:21:33