2016-07-17 60 views
0

以下蜘蛛與固定start_urls作品通過沒有定義:在scrapy,「start_urls」時,作爲輸入參數

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class NumberOfPagesSpider(CrawlSpider): 
    name = "number_of_pages" 
    allowed_domains = ["funda.nl"] 

    # def __init__(self, place='amsterdam'): 
    #  self.start_urls = ["http://www.funda.nl/koop/%s/" % place] 

    start_urls = ["http://www.funda.nl/koop/amsterdam/"] 

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) 

    rules = (Rule(le_maxpage, callback='get_max_page_number'),) 

    def get_max_page_number(self, response): 
     links = self.le_maxpage.extract_links(response) 
     max_page_number = 0             # Initialize the maximum page number 
     for link in links: 
      if link.url.count('/') == 6 and link.url.endswith('/'):   # Select only pages with a link depth of 3 
       page_number = int(link.url.split("/")[-2].strip('p'))  # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/' 
       if page_number > max_page_number: 
        max_page_number = page_number       # Update the maximum page number if the current value is larger than its previous value 
     filename = "max_pages.txt"       # File name with as prefix the place name 
     with open(filename,'wb') as f: 
      f.write('max_page_number = %s' % max_page_number)    # Write the maximum page number to a text file 

如果我通過scrapy crawl number_of_pages運行它,它寫入預期.txt文件。但是,如果我通過在行註釋和評論了start_urls =行修改,並嘗試以用戶定義的輸入參數運行,

scrapy crawl number_of_pages -a place=amsterdam 

我收到以下錯誤:

le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) 
NameError: name 'start_urls' is not defined 

因此,根據蜘蛛,start_urls沒有被定義,即使在代碼中它是在初始化中完全確定的。我怎樣才能讓這個蜘蛛與由輸入參數定義的start_urls一起工作?

回答

2

您的le_maxpage是一個類級別的變量。當您將參數傳遞給__init__時,您正在創建一個實例級變量start_urls

您在le_maxpage中使用了start_urls,因此要使le_maxpage變量有效,需要有一個名爲start_urls的類級別變量。

要解決這個問題,你需要移動你的職業等級變量,實例級,也就是定義這些__init__塊內。

+0

嗨masnun,我試圖簡單地跳格的'le_maxpage'和'rules'線,從而使他們的'__init__'塊內,這將導致蜘蛛運行。但是,它不像以前那樣生成'max_pages.txt'文件。看起來像這樣,'callback'永遠不會發生? –

+0

因爲您的規則現在沒有設置。您可以將回調名稱更改爲'parse',它應該可以工作。 – masnun

1

繼masnun的答案,我設法解決這個問題。爲了完整起見,我列出了下面的更新代碼。

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class NumberOfPagesSpider(CrawlSpider): 
    name = "number_of_pages" 
    allowed_domains = ["funda.nl"] 

    def __init__(self, place='amsterdam'): 
     self.start_urls = ["http://www.funda.nl/koop/%s/" % place] 
     self.le_maxpage = LinkExtractor(allow=r'%s+p\d+' % self.start_urls[0]) 
     rules = (Rule(self.le_maxpage,),) 

    def parse(self, response): 
     links = self.le_maxpage.extract_links(response) 
     max_page_number = 0             # Initialize the maximum page number 
     for link in links: 
      if link.url.count('/') == 6 and link.url.endswith('/'):   # Select only pages with a link depth of 3 
       page_number = int(link.url.split("/")[-2].strip('p'))  # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/' 
       if page_number > max_page_number: 
        max_page_number = page_number       # Update the maximum page number if the current value is larger than its previous value 
     filename = "max_pages.txt"       # File name with as prefix the place name 
     with open(filename,'wb') as f: 
      f.write('max_page_number = %s' % max_page_number)    # Write the maximum page number to a text file 

注意,Rule甚至不需要一個callback因爲parse總是被調用。