2017-04-07 48 views
1

我使用scrapy創建爬蟲。並創建一些爬行許多頁面的腳本。Scrapy在爬行時不處理所有頁面

不幸的是,並非所有腳本都抓取所有頁面。有些頁面會返回所有頁面,其他頁面可能只有23或180(每個URL的結果不同)。

import scrapy 

class BotCrawl(scrapy.Spider) 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 

    def parse(self, response): 
     for product in response.css("ul[class='products row-grid']"): 
      for product in product.css('li'): 
       yield { 
       'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 

       'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 

       'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 

       'kota': product.css('div[class=user-city] a::text').extract(), 

       'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 

      } 

     # next page  

     next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
     if next_page_url is not None: 
      yield scrapy.Request(response.urljoin(next_page_url)) 

它是阻止http請求或可能是我的代碼錯誤嗎?

更新代碼後Granitosaurus編輯後

還是錯誤

return blank array

import scrapy 


class BotCrawl(scrapy.Spider): 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 


def parse(self, response): 
    products = response.css('article.product-display') 
    for product in products: 
     yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
     } 


    # next page  

    next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
    last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93" 
    if next_page_url is not last_url: 
     yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True) 

謝謝

回答

1

你的產品XPath是有點靠不住。直接嘗試selectic產品的文章,該網站使得它很容易爲你使用CSS選擇做:

products = response.css('article.product-display') 
for product in products: 
    yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
    } 

您可以通過插入inspect_response調試響應:

def parse(self, response): 
    products = response.css('article.product-display') 
    if not products: 
     from scrapy.shell import inspect_response 
     inspect_response(response, self) 
     # will open up python shell here where you can check `response` object 
     # try `view(response)` to open it up in your browser and such. 
+0

它仍然無法抓取的所有頁面。那隻會回到第28頁。https://snag.gy/CpyAXP.jpg –

+0

@RadenJohannesHeryoPriambodo適合我。第28頁發生了什麼?沒有找到產品?您可以添加調試中斷點以查看發生了什麼,請參閱我的編輯。 – Granitosaurus

+0

我的意思是爬蟲停在頁面28上。頁面1-27運行良好。 @Granitosaurus –