Scrapy在爬行時不處理所有頁面

我使用scrapy創建爬蟲。並創建一些爬行許多頁面的腳本。Scrapy在爬行時不處理所有頁面

不幸的是，並非所有腳本都抓取所有頁面。有些頁面會返回所有頁面，其他頁面可能只有23或180（每個URL的結果不同）。

import scrapy 

class BotCrawl(scrapy.Spider) 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 

    def parse(self, response): 
     for product in response.css("ul[class='products row-grid']"): 
      for product in product.css('li'): 
       yield { 
       'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 

       'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 

       'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 

       'kota': product.css('div[class=user-city] a::text').extract(), 

       'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 

      } 

     # next page  

     next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
     if next_page_url is not None: 
      yield scrapy.Request(response.urljoin(next_page_url))

它是阻止http請求或可能是我的代碼錯誤嗎？

更新代碼後Granitosaurus編輯後

還是錯誤

return blank array

import scrapy 


class BotCrawl(scrapy.Spider): 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 


def parse(self, response): 
    products = response.css('article.product-display') 
    for product in products: 
     yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
     } 


    # next page  

    next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
    last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93" 
    if next_page_url is not last_url: 
     yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True)

謝謝

來源

2017-04-07 Raden Johannes Heryo Priambodo

你的產品XPath是有點靠不住。直接嘗試selectic產品的文章，該網站使得它很容易爲你使用CSS選擇做：

products = response.css('article.product-display') 
for product in products: 
    yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
    }

您可以通過插入inspect_response調試響應：

def parse(self, response): 
    products = response.css('article.product-display') 
    if not products: 
     from scrapy.shell import inspect_response 
     inspect_response(response, self) 
     # will open up python shell here where you can check `response` object 
     # try `view(response)` to open it up in your browser and such.

來源

2017-04-07 05:24:42 Granitosaurus

它仍然無法抓取的所有頁面。那隻會回到第28頁。https://snag.gy/CpyAXP.jpg –

@RadenJohannesHeryoPriambodo適合我。第28頁發生了什麼？沒有找到產品？您可以添加調試中斷點以查看發生了什麼，請參閱我的編輯。 – Granitosaurus

我的意思是爬蟲停在頁面28上。頁面1-27運行良好。 @Granitosaurus –

Scrapy在爬行時不處理所有頁面

回答

相關問題