順序抓取網站使用scrapy

有沒有辦法告訴scrapy停止根據第二級頁面的條件抓取？我做了以下情況：順序抓取網站使用scrapy

我有一個START_URL開始與（第一級頁）
我一直在使用解析設置從START_URL提取的URL（個體經營，響應）
然後，添加排隊使用請求與回調爲parseDetailPage（個體經營，響應）
在parseDetail（2級頁）我來的鏈接，知道我是否可以停止爬行或不

現在我使用CloseSpider（）來實現這一點，但問題是，當我開始爬取二級頁面時，要解析的URL已經排隊，我不知道如何從隊列中移除它們。有沒有辦法順序抓取鏈接列表，然後能夠停在parseDetailPage？

global job_in_range  
start_urls = [] 
start_urls.append("http://sfbay.craigslist.org/sof/") 
def __init__(self): 
    self.job_in_range = True 
def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    results = hxs.select('//blockquote[@id="toc_rows"]') 
    items = [] 
    if results: 
     links = results.select('.//p[@class="row"]/a/@href') 
     for link in links: 
      if link is self.end_url: 
       break; 
      nextUrl = link.extract() 
      isValid = WPUtil.validateUrl(nextUrl); 
      if isValid: 
       item = WoodPeckerItem() 
       item['url'] = nextUrl 
       item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage) 
       items.append(item) 
    else: 
     self.error.log('Could not parse the document') 
    return items 

def parseDetailPage(self, response): 
    if self.job_in_range is False: 
     raise CloseSpider('End date reached - No more crawling for ' + self.name) 
    hxs = HtmlXPathSelector(response) 
    print response 
    body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]') 
    item = response.meta['item'] 
    item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract() 
    if item['jobTitle'] is 'Admin': 
     self.job_in_range = False 
     raise CloseSpider('Stop crawling') 
    item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract() 
    item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract() 
    return item

來源

2013-02-19 Praveer

你的意思，你想阻止蜘蛛和恢復它不解析已被解析的網址嗎？如果是這樣，您可以嘗試設置the JOB_DIR setting。此設置可以將request.queue保留在磁盤上的指定文件中。

來源

2013-02-22 06:55:26

我想在parseDetail頁面滿足條件時完全停止爬網，而不是恢復它。我面臨的問題是，隊列中已經有大量的url，無論提升CloseSpider，scrapy都會抓取。 – Praveer 2013-02-25 20:18:15

您使用了哪種CloseSpider？ scrapy.contrib.closespider.CloseSpider？或scrapy.exceptions.CloseSpider？ – 2013-02-26 08:04:34

順序抓取網站使用scrapy

回答

相關問題