Scrapy：刮網頁上的「下一個」結果使用scrapy

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.http import Request 

class InfoSpider(scrapy.Spider): 

    name = 'info' 
    allowed_domains = ['womenonlyconnected.com'] 
    start_urls =['http://www.womenonlyconnected.com/socialengine/pageitems/index'] 

    def parse(self, response): 
     urls = response.xpath('//h3/a/@href').extract() 
     for url in urls: 
      absolute_url = response.urljoin(url) 
      yield Request(absolute_url , callback = self.parse_page) 



    def parse_page(self , response): 
     pass

這裏用我的代碼，這個代碼我可以湊僅前24個鏈接只需要在「查看更多」後刮所有環節幫助的頁PAG網址是bbelow http://www.womenonlyconnected.com/socialengine/pageitems/index Scrapy：刮網頁上的「下一個」結果使用scrapy

來源

2017-09-27 Haider Ali

給出調查的點點可以發現之後，您可以使用此URL進行分頁：

http://www.womenonlyconnected.com/socialengine/pageitems/index?page=N

其中ň從1開始的第一頁等等。所以，我會修改你的蜘蛛像這樣：

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.http import Request 

class InfoSpider(scrapy.Spider): 

    name = 'info' 
    allowed_domains = ['womenonlyconnected.com'] 
    start_urls = ['http://www.womenonlyconnected.com/socialengine/pageitems/index'] 
    page_num = 1 

    def parse(self, response): 
     urls = response.xpath('//h3/a/@href').extract() 
     for url in urls: 
      absolute_url = response.urljoin(url) 
      yield Request(absolute_url , callback = self.parse_page) 

     if self.page_num < 100: 
      self.page_num += 1 
      yield Request(start_urls[0] + '?page={}'.format(self.page_num) , callback = self.parse) 

    def parse_page(self , response): 
     pass

的原因，我停在100頁，這不是那麼容易，以確定是否有更多的結果，因此，如果你應該去到下一個頁面。理論上，您可以檢查頁面上是否存在查看更多元素。問題是它總是存在的，如果沒有更多頁面帶有結果，它就會隱藏。但隱藏這個元素髮生在JavaScript中，所以Scrapy總是看到它是隱藏的。爲了可靠地判斷是否有更多頁面，你必須使用例如Splash。

來源

2017-09-27 05:45:28

@TomášLinhart，我也檢查過該網站。這是最後一頁'http：//www.womenonlyconnected.com/socialengine/pageitems/index？page = 47'。 – SIM

@Shahin這不是真正通用的解決方案，因爲添加文章時頁面的數量可能隨時間而變化。真正通用的解決方案涉及使用（無頭）瀏覽器來呈現頁面，例如，濺。 –

Scrapy：刮網頁上的「下一個」結果使用scrapy

回答

相關問題