2017-10-06 116 views
0

雖然有很多問題,但由於「dont_filter」參數,大多數人遇到此問題,我通過了這個參數「dont_filter = True」,但我的自定義解析生成器仍然沒有工作,下面是我的代碼(第三個解析器「parse_spec」從來沒有被調用,「parse_models_follow_next_page」在被parse()調用時工作得很好,但當它需要轉到下一頁時它不能調用自己) :scrapy.Reaquests()回調不起作用

import scrapy 
from gsmarena.items import PhoneItems 

class VendorSpider(scrapy.Spider): 
    custom_settings = { 
     'DOWNLOAD_DELAY': 1.5, 
     'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A', 
     'COOKIES_ENABLED': False 
    } 

    name = "gsmarena_spec" 

    allowed_domains = ["https://www.gsmarena.com/"] 

    start_urls = [ 
     "https://www.gsmarena.com/makers.php3" 
    ] 

    def parse(self, response): 
     # print("Existing settings: %s" % self.settings.attributes.items()) 
     length = len(response.xpath("//table//a").extract()) 
     for i in range(1, length): 
      brand = response.xpath(
       '(//table//a)[{}]/text()'.format(i)).extract()[0] 
      url = "https://www.gsmarena.com/" + \ 
       response.xpath("(//table//a)[{}]/@href".format(i)).extract()[0] 
      yield scrapy.Request(url, callback=self.parse_models_follow_next_page, meta={'brand': brand}, dont_filter=True) 

    def parse_models_follow_next_page(self, response): 
     brand = response.meta.get('brand') 
     length = len(response.xpath(
      "//div[@class='makers']/self::div//a").extract()) 
     for i in range(1, length): 
      url = "https://www.gsmarena.com/" + \ 
       response.xpath(
        "(//div[@class='makers']/self::div//a)[{}]/@href".format(i)).extract()[0] 
      model = response.xpath(
       "(//div[@class='makers']/self::div//a//span/text())[{}]".format(i)).extract()[0] 
      yield scrapy.Request(url, callback=self.parse_spec, meta={'brand': brand, 'model': model}, dont_filter=True) 
     is_next_page = response.xpath(
      "//a[@class=\"pages-next\"]/@href").extract() 

     if is_next_page: 
      next_page = "https://www.gsmarena.com/" + is_next_page[0] 
      yield scrapy.Request(next_page, callback=self.parse_models_follow_next_page, meta={'brand': brand}, dont_filter=True) 



    def parse_spec(self, response): 
     item = PhoneItems() 
     item['model'] = response.meta.get('model') 
     item['brand'] = response.meta.get('brand') 
     for spec_name, spec in zip(response.xpath('//table//td[1]').extract(), response.xpath('//table//td[2]').extract()): 
     item[spec_name] = spec 
     yield item 

和我的英語不好對不起

+0

它的作品在我身邊罰款:'{「模式」:'45鈦」 ,' CtheSky

回答

0

你刷屏有幾個問題。

allowed_domains = ["https://www.gsmarena.com/"] 

應該

allowed_domains = ["www.gsmarena.com"] 

下一頁你沒有在你的類中定義

def errback_httpbin(self, response): 
    pass 

下面的代碼errback_httpbin方法

for spec_name, spec in zip(response.xpath('//table//td[1]').extract(), response.xpath('//table//td[2]').extract()): 

應該是

for spec_name, spec in zip(response.xpath('//table//td[1]/text()').extract(), response.xpath('//table//td[2]/text()').extract()): 

這雖然仍然有一些問題。

而且你的代碼會需要一些時間,第一產量,用作調度將基於URL的在未來的順序挑選的網址

+0

我已經定義了errback_httpbin函數,並忘記在這裏發帖,對不起。感謝您的建議。 –

+0

你說得對,我只是需要更多的耐心。我再次嘗試了我的劇本,第三部劇本幾十秒後就被調用了。這是我第一次使用scrapy,我認爲它會按順序工作。非常感謝。 –

0

我已經在代碼中的一些變化,它廢料全部結果預計spec_name,即沒有以理解的方式指定。

進口scrapy

從LXML導入HTML

從tutorial.items進口PhoneItems

類VendorSpider(scrapy.Spider):

custom_settings = { 
    'DOWNLOAD_DELAY': 1.5, 
    'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) 
    AppleWebKit/537.75.14 (KHTML, ' 
        'like Gecko) Version/7.0.3 Safari/7046A194A', 
    'COOKIES_ENABLED': False 
    } 
    name = "gsmarena_spec" 
    allowed_domains = ["https://www.gsmarena.com/"] 
    start_urls = [ 
     "https://www.gsmarena.com/makers.php3" 
    ] 

    def parse(self, response): 
     # print("Existing settings: %s" % 
     self.settings.attributes.items()) 
     length = len(response.xpath("//table//a").extract()) 
     for i in range(1, length): 
      brand = response.xpath(
      '(//table//a)[{}]/text()'.format(i)).extract()[0] 
      url = "https://www.gsmarena.com/" + \ 
      response.xpath("(//table//a) 
       [{}]/@href".format(i)).extract()[0] 
      yield scrapy.Request(url, 
       callback=self.parse_models_follow_next_page, 
       meta={'brand':brand},dont_filter=True) 

    def parse_models_follow_next_page(self, response): 
     brand = response.meta.get('brand') 
     meta = response.meta 
     doc = html.fromstring(response.body) 
     single_obj = doc.xpath('.//div[@class="makers"]/ul//li') 
     for obj in single_obj: 
      url = self.allowed_domains[0]+obj.xpath('.//a/@href')[0] 
      meta['brand'] = obj.xpath('.//a/@href')[0].split('_')[0] 
      meta['model'] = obj.xpath('.//a/@href')[0] 
      yield scrapy.Request(url=url, callback=self.parse_spec, 
       meta=meta, dont_filter=True) 
     is_next_page = response.xpath(
         "//a[@class=\"pages-next\"]/@href").extract() 
     if is_next_page: 
      next_page = "https://www.gsmarena.com/" + is_next_page[0] 
      yield scrapy.Request(next_page, 
      callback=self.parse_models_follow_next_page, 
      meta={'brand': brand},dont_filter=True) 

    def parse_spec(self, response): 
     item = PhoneItems() 
     meta = response.meta 
     item['model'] = meta['model'] 
     item['brand'] = meta['brand'] 

     #Need to specify details about spec_name 
     # for spec_name, spec in 
     #zip(response.xpath('//table//td[1]').extract(), 
     # response.xpath('//table//td[2]').extract()): 
     #  item[spec_name] = spec 
     yield item 
+0

您的代碼效果很好,謝謝您的辛勤工作。 spec_name用於記錄某些智能手機的規格,如CPU,屏幕尺寸等。 –