2012-12-13 48 views
0

使用BaseSpider時,如何提鏈接提取規則想這是我的代碼在scrapy

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from dmoz.items import DmozItem 

class DmozSpider(BaseSpider): 
    domain_name = "dmoz.org" 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//ul[2]/li') 
     items = [] 
     for site in sites: 
      item = DmozItem() 
      item['title'] = site.select('a/text()').extract() 
      item['link'] = site.select('a/@href').extract() 
      item['desc'] = site.select('text()').extract() 
      items.append(item) 
     return items 

SPIDER = DmozSpider() 

如果我用crawlSpider然後我可以使用規則來實現thelink提取,但我怎麼能提基地蜘蛛規則。就像上面的例子。因爲規則只適用於爬蟲而不是基地蜘蛛

回答

0

也許你可以解析你的規則標準的響應,然後將成功的響應傳遞給第二個回調?下面的僞代碼:

def parse(self, response): 
    # check response for rule criteria 
    ... 
    if rule: 
     # create new request to pass to second callback 
     req = Request("http://www.example.com/follow", callback=self.parse2) 
     return req 

def parse2(self, response): 
    hxs = HtmlXPathSelector(response) 
    # do stuff with the successful response 
+0

我可以遞歸地調用解析函數。或者它會是第二個解析函數的bteer – user1858027

+0

'parse'函數將被調用到所有的啓動URL。在您將每個新請求傳遞給'parse2'之前,您需要正確處理每個響應並查找與您的規則相匹配的鏈接。 – Talvalin