2016-01-05 41 views
2

我有一個看起來像這樣的Scrapy蜘蛛。基本上它需要一個URL列表,遵循內部鏈接並抓取外部鏈接。我想要做的是使它有同步,以便按順序解析url_list。Python Scrapy - 收益聲明不能按預期工作

class SomeSpider(Spider): 
    name = 'grablinksync' 
    url_list = ['http://www.sports.yahoo.com/', 'http://www.yellowpages.com/'] 
    allowed_domains = ['www.sports.yahoo.com', 'www.yellowpages.com'] 
    links_to_crawl = [] 
    parsed_links = 0 

    def start_requests(self): 
     # Initial request starts here 
     start_url = self.url_list.pop(0) 
     return [Request(start_url, callback=self.get_links_to_parse)] 

    def get_links_to_parse(self, response): 
     for link in LinkExtractor(allow=self.allowed_domains).extract_links(response): 
      self.links_to_crawl.append(link.url) 
      yield Request(link.url, callback=self.parse_obj, dont_filter=True) 

    def start_next_request(self): 
     self.parsed_links = 0 
     self.links_to_crawl = [] 
     # All links have been parsed, now generate request for next URL 
     if len(self.url_list) > 0: 
      yield Request(self.url_list.pop(0), callback=self.get_links_to_parse) 

    def parse_obj(self,response): 
     self.parsed_links += 1 
     for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response): 
      item = CrawlsItem() 
      item['DomainName'] = get_domain(response.url) 
      item['LinkToOtherDomain'] = link.url 
      item['LinkFoundOn'] = response.url 
      yield item 
     if self.parsed_links == len(self.links_to_crawl): 
      # This doesn't work 
      self.start_next_request() 

我的問題是功能start_next_request()永遠不會被調用。如果我將start_next_request()中的代碼移入parse_obj()函數中,那麼它將按預期工作。

def parse_obj(self,response): 
      self.parsed_links += 1 
      for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response): 
       item = CrawlsItem() 
       item['DomainName'] = get_domain(response.url) 
       item['LinkToOtherDomain'] = link.url 
       item['LinkFoundOn'] = response.url 
       yield item 
      if self.parsed_links == len(self.links_to_crawl): 
       # This works.. 
       self.parsed_links = 0 
       self.links_to_crawl = [] 
       # All links have been parsed, now generate request for next URL 
       if len(self.url_list) > 0: 
        yield Request(self.url_list.pop(0), callback=self.get_links_to_parse) 

我想抽象掉start_next_request()功能,因爲我打算從其他一些地方調用它。我知道它與start_next_request()是一個生成器函數有關。但我是新來的發電機和產量,所以我很難搞清楚我做錯了什麼。

+0

請仔細研究發佈指南,你應該提取一個最小的例子。 –

回答

0

這是因爲yield使該功能成爲一個發電機,並簡單地寫self.start_next_request()不會使發電機做任何事情。

發電機是懶惰的,這意味着除非你問它的第一個對象 - 它不會做任何事情。

您可以更改代碼:

def parse_obj(self,response): 
    self.parsed_links += 1 
    for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response): 
     item = CrawlsItem() 
     item['DomainName'] = get_domain(response.url) 
     item['LinkToOtherDomain'] = link.url 
     item['LinkFoundOn'] = response.url 
     yield item 
    if self.parsed_links == len(self.links_to_crawl): 
     for res in self.start_next_request(): 
      yield res 

爲您傳回發電機即使return self.start_next_request()會工作。