在這裏刮新手。我正在使用Scrapy從一個站點獲取大量數據。當我運行該腳本,它工作正常了幾分鐘,但隨後減慢,只是停止,並不斷拋出下面的兩個不同的網址錯誤它正試圖颳去:Scrapy腳本的錯誤
2013-07-20 14:15:17-0700 [billboard_spider] DEBUG: Retrying <GET http://www.billboard.com/charts/1981-01-17/hot-100> (failed 1 times): Getting http://www.billboard.com/charts/1981-01-17/hot-100 took longer than 180 seconds.
2013-07-20 14:16:56-0700 [billboard_spider] DEBUG: Crawled (502) <GET http://www.billboard.com/charts/1981-01-17/hot-100> (referer: None)
上述錯誤與堆放不同的URL,我不知道是什麼導致它...
這裏的腳本:
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1975, month=12, day=27)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
item = BillBoardItem()
item['date'] = date
for song in songs:
try:
track = song.select('.//header/h1/text()').extract()[0]
track = track.rstrip()
item['song'] = track
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
break
except:
continue
yield item
難道是你被禁止? http://doc.scrapy.org/en/0.16/topics/practices.html#bans – Tiago