2012-11-27 25 views
0

目前,我有以下規則:即使請求已排隊,Scrapy也不會爬網?

# Matches all comments page under user overview, 
# http://lookbook.nu/user/50784-Adam-G/comments/ 
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments/?$'), deny=('\?locale=')), 
    callback='parse_model_comments'), 
# http://lookbook.nu/user/50784-Adam-G/comments?page=2 
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments\?page=\d+$'), deny=('\?locale=')), 
    callback='parse_model_comments'), 

在我的函數的定義,

def parse_model_comments(self, response): 
    log.msg("Inside parse_model_comments") 
    hxs = HtmlXPathSelector(response) 
    model_url = hxs.select('//div[@id="userheader"]/h1/a/@href').extract()[0] 
    comments_hxs = hxs.select(
     '//div[@id="profile_comments"]/div[@id="comments"]/div[@class="comment"]') 
    if comments_hxs: 
    log.msg("Yielding next page." + LookbookSpider.next_page(response.url)) 
    yield Request(LookbookSpider.next_page(response.url)) 

這是實際的運行日誌:

2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: None) 
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments) 
2012-11-26 18:52:46-0800 [scrapy] INFO: Inside parse_model_comments 
2012-11-26 18:52:46-0800 [scrapy] INFO: Yielding next page.http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2 
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Scraped from <200 http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> 
    {'model_url': u'http://lookbook.nu/rachinald', 
    'posted_at': u'2012-11-26T13:21:49-05:00', 
    'target_url': u'http://lookbook.nu/look/4290423-Blackout-Challenge-One', 
    'text': u"Thanks Justina :) They're actually purple - the whole premise is to not wear black all week ^^", 
    'type': 2} 
... 
2012-11-26 18:52:47-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments) 
2012-11-26 18:52:48-0800 [lookbook] INFO: Closing spider (finished) 
2012-11-26 18:52:48-0800 [lookbook] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 2072, 
    'downloader/request_count': 3, 
    'downloader/request_method_count/GET': 3, 
    'downloader/response_bytes': 51499, 
    'downloader/response_count': 3, 
    'downloader/response_status_count/200': 3, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 11, 27, 2, 52, 48, 43058), 
    'item_scraped_count': 14, 
    'log_count/DEBUG': 23, 
    'log_count/INFO': 6, 
    'request_depth_max': 3, 
    'response_received_count': 3, 
    'scheduler/dequeued': 3, 
    'scheduler/dequeued/memory': 3, 
    'scheduler/enqueued': 3, 
    'scheduler/enqueued/memory': 3, 
    'start_time': datetime.datetime(2012, 11, 27, 2, 52, 44, 446851)} 
2012-11-26 18:52:48-0800 [lookbook] INFO: Spider closed (finished) 

即使頁= 2爬,parse_model_comments未被調用,因爲「內部parse_model_comments」未被記錄。

我檢查了re.search('/user/\d+[^/]+/comments\?page=\d+$', 'http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2')並確認它確實有效。

任何想法爲什麼page = 2被抓取但該函數未被調用?

回答

0

原來你在CrawlSpider時手動指定回調你產生一個請求。

如果請求對象沒有設置回調,將調用默認的parse()。

Crawlspider的parse()函數只是返回[] 檢查源

-2

我知道這看起來很奇怪,但回調不應該是一個發電機(產量東西)

我會建議:

def parse_model_comments(self, response): 
    return list(_iter_parse_model_comments(self, response)) 

def _iter_parse_model_comments(self, response): 
    # place your current code here 
+0

這是不正確的。 – disappearedng