Scrapy在看似隨機點

我從this site刮倫敦住房的廣告。

可以搜索3種不同面積的房屋廣告：倫敦的全部，特定地區（例如倫敦中部）或特定分區（如Aldgate）的房屋廣告。

該網站僅允許您檢查每個區域30個廣告的50個頁面，無論該區域的大小如何。即如果我選擇X，我可以在X中查看1500個廣告，無論X是倫敦中心還是Aldgate。

在寫這個問題的時候，網站上有超過37000個廣告。

因爲我想盡可能多的廣告，這個限制意味着我需要在小區級別上刮廣告。

要做到這一點，我寫了下面的蜘蛛，

# xpath to area/sub area links 
area_links = ('''//*[@id="fullListings"]/div[1]/div/div/nav/aside/''' 
      '''section[1]/div/ul/li/a/@href''') 

class ApartmentSpider(scrapy.Spider): 
    name = 'apartments2' 
    start_urls = [ 
     "https://www.gumtree.com/property-to-rent/london" 
     ] 

    # obtain links to london areas 
    def parse(self, response):     
      for url in response.xpath(area_links).extract(): 
       yield scrapy.Request(response.urljoin(url), 
         callback=self.parse_sub_area)  

    # obtain links to london sub areas 
    def parse_sub_area(self, response):     
      for url in response.xpath(area_links).extract(): 
       yield scrapy.Request(response.urljoin(url), 
         callback=self.parse_ad_overview)  

    # obtain ads per sub area page 
    def parse_ad_overview(self, response):     
      for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]', 
            ).css('ul').css('li').css('a', 
              ).xpath('@href').extract(): 
       yield scrapy.Request(response.urljoin(ads), 
         callback=self.parse_ad) 

       next_page = response.css(
      '#srp-results > div.grid-row > div > ul > li.pagination-next > a', 
             ).xpath('@href').extract_first() 
       if next_page is not None: 
        next_page = response.urljoin(next_page) 
        yield scrapy.Request(next_page, callback=self.parse) 

    # obtain info per ad 
    def parse_ad(self, response): 

    # here follows code to extract of data per ad

工作正常。

也就是說，它獲得

住房的廣告每分區域頁面的鏈接的，從最初的頁面從各自的區域頁面

子區域

區，每區，遍歷每個子區域的所有頁面

最終從每個廣告中刮取數據。

問題

代碼停在看似隨意颳了，我不知道爲什麼。

我懷疑它已經達到了極限，因爲它被告知要刮許多鏈接和項目，但我不確定我是否正確。

當它停了，它指出，

{'downloader/request_bytes': 1295950, 
'downloader/request_count': 972, 
'downloader/request_method_count/GET': 972, 
'downloader/response_bytes': 61697740, 
'downloader/response_count': 972, 
'downloader/response_status_count/200': 972, 
'dupefilter/filtered': 1806, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 9, 4, 17, 13, 35, 53156), 
'item_scraped_count': 865, 
'log_count/DEBUG': 1839, 
'log_count/ERROR': 5, 
'log_count/INFO': 11, 
'request_depth_max': 2, 
'response_received_count': 972, 
'scheduler/dequeued': 971, 
'scheduler/dequeued/memory': 971, 
'scheduler/enqueued': 971, 
'scheduler/enqueued/memory': 971, 
'spider_exceptions/TypeError': 5, 
'start_time': datetime.datetime(2017, 9, 4, 17, 9, 56, 132388)}

我不知道，如果人們可以從這個是否我已經打了極限或閱讀的東西，但如果有人不知道，請讓我知道如果我做了，如何防止代碼停止。

來源

2017-09-04 LucSpan

您只獲得狀態200響應。如果事情真的發生了錯誤或者您被阻止，您將得到服務不可用的響應（503）或類似情況。您是否認爲代碼過早停止，因爲項目數量在不同的運行中會有所不同？ – Andras

嗨安德拉斯，恐怕我不明白你的意思是'物品數量因不同跑步而異'。 – LucSpan

爲什麼你認爲你的代碼會提前停止提取？ – Andras

儘管完整的或至少部分的抓取過程日誌會幫助您排除故障，但是我要承擔風險併發布此答案，因爲我看到了一件事;我假設是問題

def parse_ad_overview(self, response):     
      for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]', 
            ).css('ul').css('li').css('a', 
              ).xpath('@href').extract(): 
       yield scrapy.Request(response.urljoin(ads), 
         callback=self.parse_ad) 

       next_page = response.css(
      '#srp-results > div.grid-row > div > ul > li.pagination-next > a', 
             ).xpath('@href').extract_first() 
       if next_page is not None: 
        next_page = response.urljoin(next_page) 
        yield scrapy.Request(next_page, callback=self.parse)

我敢肯定，我知道發生了什麼事情，跑了過去類似的問題，看着你的腳本，當你從最後一個函數的回調運行你的下一個頁面發回它解析...其中我假設到下一頁的鏈接是在那些情況下http responce ...所以只需將回調改爲parse_ad_overview ...

來源

2017-09-05 00:00:26 scriptso

Scrapy在看似隨機點

回答

相關問題