2017-07-01 47 views
0

我是scrapy的新手,到目前爲止我已經能夠創建幾個蜘蛛。我想寫一個抓取Yellowpages的蜘蛛,尋找具有404響應的網站,蜘蛛工作正常,但是,分頁不起作用。任何幫助都感激不盡。在此先感謝需要幫助YellowPages蜘蛛

# -*- coding: utf-8 -*- 
import scrapy 


class SpiderSpider(scrapy.Spider): 
    name = 'spider' 
    #allowed_domains = ['www.yellowpages.com'] 
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL'] 

    def parse(self, response): 
    for listing in response.css('div.search-results.organic div.srp-listing'): 

     url = listing.css('a.track-visit-website::attr(href)').extract_first() 

     yield scrapy.Request(url=url, callback=self.parse_details) 


    # follow pagination links 

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first() 
    next_page_url = response.urljoin(next_page_url) 
    if next_page_url: 
     yield scrapy.Request(url=next_page_url, callback=self.parse) 

    def parse_details(self,response): 
    yield{'Response': response,} 
+0

嗨大衛,這是我在這裏的第一次發帖,我是有格式的代碼問題。我的問題很簡單我有這個蜘蛛的分頁問題。不知道我在這裏錯過什麼 – oscarQ

回答

1

我跑你的代碼,發現有一些錯誤。在第一個循環中,您不檢查url的值,有時它是None。這個錯誤會停止執行,這就是爲什麼你認爲分頁不起作用。

這裏是一個工作代碼:

# -*- coding: utf-8 -*- 
import scrapy 


class SpiderSpider(scrapy.Spider): 
    name = 'spider' 
    #allowed_domains = ['www.yellowpages.com'] 
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL'] 

    def parse(self, response): 
     for listing in response.css('div.search-results.organic div.srp-listing'): 
      url = listing.css('a.track-visit-website::attr(href)').extract_first() 
      if url: 
       yield scrapy.Request(url=url, callback=self.parse_details) 
     next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first() 
     next_page_url = response.urljoin(next_page_url) 
     if next_page_url: 
      yield scrapy.Request(url=next_page_url, callback=self.parse) 

    def parse_details(self,response): 
     yield{'Response': response,} 
+0

非常感謝,你們真棒! – oscarQ

+0

沒問題,如果這解決了您的問題,請毫不猶豫地驗證答案。 –