2016-07-14 32 views
0

我寫了一個蜘蛛,其唯一目的是從http://www.funda.nl/koop/amsterdam/中提取一個數字,即從底部尋呼機的最大頁數(例如, ,下面例子中的數字255)。Scrapy feed輸出包含多次而不是一次的期望輸出

enter image description here

我成功地做到這一點使用基於正則表達式,這些網頁的網址都符合LinkExtractor。蜘蛛如下圖所示:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.crawler import CrawlerProcess 
from Funda.items import MaxPageItem 

class FundaMaxPagesSpider(CrawlSpider): 
    name = "Funda_max_pages" 
    allowed_domains = ["funda.nl"] 
    start_urls = ["http://www.funda.nl/koop/amsterdam/"] 

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/ 

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'), 
    ) 

    def get_max_page_number(self, response): 
     links = self.le_maxpage.extract_links(response) 
     max_page_number = 0             # Initialize the maximum page number 
     page_numbers=[] 
     for link in links: 
      if link.url.count('/') == 6 and link.url.endswith('/'):   # Select only pages with a link depth of 3 
       page_number = int(link.url.split("/")[-2].strip('p'))  # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/' 
       page_numbers.append(page_number) 
       # if page_number > max_page_number: 
       #  max_page_number = page_number       # Update the maximum page number if the current value is larger than its previous value 
     max_page_number = max(page_numbers) 
     print("The maximum page number is %s" % max_page_number) 
     yield {'max_page_number': max_page_number} 

如果我跑這跟飼料輸出通過在命令行中輸入scrapy crawl Funda_max_pages -o funda_max_pages.json,生成的JSON文件看起來像這樣:

[ 
{"max_page_number": 257}, 
{"max_page_number": 257}, 
{"max_page_number": 257}, 
{"max_page_number": 257}, 
{"max_page_number": 257}, 
{"max_page_number": 257}, 
{"max_page_number": 257} 
] 

我覺得奇怪的是,字典輸出7次而不是一次。畢竟,yield語句不在for循環之外。誰能解釋這種行爲?

回答

3
  1. 你的蜘蛛首先進入start_url。
  2. 使用LinkExtractor提取7個URL。
  3. 下載這7個網址中的每一個,並在每個網址上調用get_max_page_number
  4. 對於每個網址get_max_page_number返回一個字典。
0

作爲一種變通方法,我寫輸出到一個文本文件來代替JSON提要輸出:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.crawler import CrawlerProcess 

class FundaMaxPagesSpider(CrawlSpider): 
    name = "Funda_max_pages" 
    allowed_domains = ["funda.nl"] 
    start_urls = ["http://www.funda.nl/koop/amsterdam/"] 

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/ 

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'), 
    ) 

    def get_max_page_number(self, response): 
     links = self.le_maxpage.extract_links(response) 
     max_page_number = 0             # Initialize the maximum page number 
     for link in links: 
      if link.url.count('/') == 6 and link.url.endswith('/'):   # Select only pages with a link depth of 3 
       print("The link is %s" % link.url) 
       page_number = int(link.url.split("/")[-2].strip('p'))  # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/' 
       if page_number > max_page_number: 
        max_page_number = page_number       # Update the maximum page number if the current value is larger than its previous value 
     print("The maximum page number is %s" % max_page_number) 
     place_name = link.url.split("/")[-3]        # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/' 
     print("The place name is %s" % place_name) 
     filename = str(place_name)+"_max_pages.txt"       # File name with as prefix the place name 
     with open(filename,'wb') as f: 
      f.write('max_page_number = %s' % max_page_number)    # Write the maximum page number to a text file 
     yield {'max_page_number': max_page_number} 

process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
}) 

process.crawl(FundaMaxPagesSpider) 
process.start() # the script will block here until the crawling is finished 

我也適應了蜘蛛來運行它作爲一個腳本。該腳本將生成一行max_page_number: 257的文本文件amsterdam_max_pages.txt

+0

你仍然在爬行7個網址,但是你用'max_page_number:257'覆蓋同一個文件7次... – Granitosaurus