0
我寫了一個蜘蛛從網站上下載數據,然後點擊鏈接以獲取詳細數據。 蜘蛛還使用默認的scrapy圖像管道下載圖像。到目前爲止一切正常。Scrapy - 圖像下載不能在第二次抓取中工作
但是當我第二次[用另一個搜索詞]啓動蜘蛛時,圖像下載不再起作用。爬行工作應該像。我沒有得到任何錯誤。
這是蜘蛛:
class DiscoSpider(BaseSpider):
def __init__(self, query):
super(BaseSpider, self).__init__()
self.name = "discogs"
self.allowed_domains = ["discogs.com"]
self.start_urls = [
"http://www.discogs.com/search?q=%s&type=release" % query
]
# parse all releases for the current search
def parse(self, response):
logging.debug('scrapy.parse')
hxs = HtmlXPathSelector(response)
li = hxs.select("//div[@id='page_content']/ol/li")
items = []
for l in li:
item = DiscogsItem()
...
# get the link for the callback for the tracklist
link = l.select("a/@href").extract()[0]
item['link'] = '' if link == None else link
# get the img location
img = l.select("a/img/@src").extract()
item['image_urls'] = [None] if img == None else img
# get the url for the tracklist callback
url = urlparse.urljoin('%s%s' % ('http://www.', self.allowed_domains[0]), link)
# request and callback to get tracklist for release
item = Request(url, meta={'item':item}, callback=self.parse_tracklist)
items.append(item)
yield item
# callback to get the tracklist for each release
def parse_tracklist(self, response):
item = response.request.meta['item']
hxs = HtmlXPathSelector(response)
rows = hxs.select("//div[@class='section_content']/table[@class='playlist mini_playlist']/tr")
tracklist = []
for row in rows:
track = {}
title = row.select("td[@class='track']/span[@class='track_title']/text()").extract()
track['title'] = '' if title in [None, '', []] else self.clean_track(title[0])
...
tracklist.append(track)
item['tracklist'] = tracklist
yield item
這是項目:
class DiscogsItem(Item):
# define the fields for your item here like:
link = Field()
artist = Field()
release = Field()
label = Field()
year = Field()
tracklist = Field()
image_urls = Field()
images = Field()
thumb = Field()
在我scrapy設置:
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = '/home/f/work/py/discogs/tmp'
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 20
IMAGES_EXPIRES = 0
我跑從PyQt的-UI的蜘蛛一個單獨的過程,我是新的Scrapy/PyQT/StackOverflow(抱歉格式不對)。
我在Python 2.7,PyQt4和Scrapy 0.12.0.2546上的Xubuntu 12.04盒子。
有誰知道爲什麼第二個圖像下載不起作用?
在此先感謝。
如果兩個看起來獨立的運行中的第一個運行,而第二個運行沒有,那麼可能有一些狀態影響第二輪運行。最初的想法是它可能是discogs服務器,以某種方式限制你的爬行速度。你可以發佈你的調試輸出從第二次刮跑(和總結第一)? – Michael 2013-02-24 16:16:34
謝謝你的回答! – user937284 2013-02-26 11:44:37