在深度網絡中使scrapy抓取文檔時面臨的問題

我希望我的蜘蛛抓取每個人的「跟隨者」和「跟隨」信息的數量。目前它只能提供數千個結果中的6個結果。我怎樣才能獲得完整的結果？在深度網絡中使scrapy抓取文檔時面臨的問題

「items.py」包括：

import scrapy 
class HouzzItem(scrapy.Item): 
    Following = scrapy.Field() 
    Follower= scrapy.Field()

蜘蛛命名爲「houzzsp.py」包括：

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 

class HouzzspSpider(CrawlSpider): 
    name = "houzzsp" 
    allowed_domains = ['www.houzz.com'] 
    start_urls = ['http://www.houzz.com/professionals'] 

    rules = [ 
      Rule(LinkExtractor(restrict_xpaths='//li[@class="sidebar-item"]')), 
      Rule(LinkExtractor(restrict_xpaths='//a[@class="navigation-button next"]')), 
      Rule(LinkExtractor(restrict_xpaths='//div[@class="name-info"]'), 
      callback='parse_items') 
    ]  


    def parse_items(self, response): 
     page = response.xpath('//div[@class="follow-section profile-l-sidebar "]') 
     for titles in page: 
      Score = titles.xpath('.//a[@class="following follow-box"]/span[@class="follow-count"]/text()').extract() 
      Score1 = titles.xpath('.//a[@class="followers follow-box"]/span[@class="follow-count"]/text()').extract() 
      yield {'Following':Score,'Follower':Score1}

編輯：在規則進行了更改，並正在按我的預期。

來源

2017-04-08 SIM

當使用scrapy的LinkExtractor和restrict_xpaths參數時，您不需要指定url要遵循的確切xpath。從scrapy's documentation：

restrict_xpaths（STR或列表） - 是一個XPath（或XPath的列表），其定義了鏈接應該從被提取響應內部區域。

所以這個想法是指定部分，所以LinkExtractor只會深入到這些標籤來查找鏈接。

總之，不添加a標籤內restrict_xpaths（@href會更差），因爲LinkExtractor會找到你指定的XPath的內部a標籤。

來源

2017-04-08 23:48:01 eLRuLL

感謝eLRuLL爲您的迴應。從規則中刪除「href」從數千條結果中獲得6條結果。我現在可以做些什麼來獲得全部結果。 – SIM

不要設置'a'標籤，而是設置一些標籤。 – eLRuLL

感謝eLRuLL，感謝您的回覆。看來，從我之前設定的規則略微改變規則現在正在工作。然而，它很難得到真正的照片在命令提示符中發生的事情，因此它運行得很快。該網站獲得數百萬條記錄。順便說一下，如果我強制關閉cmd，是否有可能獲得csv中的數據？ – SIM

在深度網絡中使scrapy抓取文檔時面臨的問題

回答

相關問題