Scrapy蜘蛛在幾分鐘後返回200響應

我試圖取消網站時遇到動態內容問題。我只是用泊塢窗使用濺到添加到我的Scrapy如下：Scrapy蜘蛛在幾分鐘後返回200響應

https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/

不幸的是，我還沒有捕獲，因爲動態內容的內容（可能？）。

我的代碼運行時，捕獲的內容，然後刮約4000頁後，它只是返回該錯誤的下一個6000頁，其中大部分數據：

[scrapy.core.engine] DEBUG: Crawled (200) <GET http://www...> (referer: None)

這裏是我的蜘蛛代碼：

import scrapy 
from scrapy_splash import SplashRequest 

class PeopleSpider(scrapy.Spider): 
name="people" 
start_urls=[ 
    'http://www.canada411.ca/res/%s/' % page for page in xrange(5192080000,5192090000) 
] 
def start_requests(self): 
    for url in self.start_urls: 
    yield SplashRequest(url, self.parse, 
    endpoint='render.html', 
    args={'wait': 2}, 
    ) 
def parse(self,response): 
    for people in response.css('div#contact'): 
    yield{ 
    'name': people.css('h1.vcard__name::text').extract_first().strip().title(), 
    'address': people.css('div.vcard__address::text').extract_first().strip().split(',')[0].strip(), 
    'city': people.css('div.vcard__address::text').extract_first().strip().split(',')[1].strip().split(' ')[0].strip(), 
    'province': people.css('div.vcard__address::text').extract_first().strip().split(',')[1].strip().split(' ')[1].strip(), 
    'postal code': people.css('div.vcard__address::text').extract_first().split(',')[2].strip().replace(' ',''), 
    'phone': people.css('span.vcard__label::text').extract_first().replace('(','').replace(')','').replace('-','').replace(' ',''), 
    }

來源

2017-02-23 Maciek Semik

也許您正在抓取的網站已開始顯示驗證碼 – Umair

有趣的，任何解決方案？ –

我無法發佈代碼/解決方案，我建議您在沒有獲取數據的情況下將響應的HTML保存在文件中，然後在瀏覽器中打開該HTML文件以查看該頁面上不存在名稱，地址等原因 – Umair

在文件響應

保存HTML時，你沒有得到數據，然後打開HTML文件瀏覽器，看看爲什麼name，address等在該頁面上不存在。

由於來自同一IP的連續請求，我懷疑他們顯示驗證碼。

如果他們表現出的驗證碼，您可以使用代理服務，以避免驗證碼，

還要創建一個DownloadMiddleware和內部process_request功能，檢查是否有驗證碼，然後再颳去該鏈接與dont_filter=True參數。

編輯

您可以寫信給使用該代碼的文件，BTW只是谷歌，你會發現一堆的書寫方式使用Python到文件。

with open('response.html', '2+') as the_file: 
    the_file.write(response.body)

來源

2017-02-24 13:41:07 Umair

你能告訴我如何將HTML保存到文件中嗎？ –

@MaciekSemik查看更新的答案 – Umair

Scrapy蜘蛛在幾分鐘後返回200響應

回答

相關問題