檢查網址是否有404錯誤scrapy

我正在瀏覽一組頁面，我不確定有多少頁面，但當前頁面是由URL中存在的簡單數字表示的（例如「http://www.website.com/page/1」）檢查網址是否有404錯誤scrapy

我想在scrapy中使用for循環來增加頁面上的當前猜測，並在達到404時停止。我知道從請求返回的響應包含此信息，但我不是確定如何自動從請求中獲得響應。

有關如何做到這一點的任何想法？

目前我的代碼是沿着線的東西：

def start_requests(self): 
    baseUrl = "http://website.com/page/" 
    currentPage = 0 
    stillExists = True 
    while(stillExists): 
     currentUrl = baseUrl + str(currentPage) 
     test = Request(currentUrl) 
     if test.response.status != 404: #This is what I'm not sure of 
      yield test 
      currentPage += 1 
     else: 
      stillExists = False

來源

2013-04-07 Slater Victoroff

你可以做這樣的事情：通過一系列

from __future__ import print_function 
import urllib2 

baseURL = "http://www.website.com/page/" 

for n in xrange(100): 
    fullURL = baseURL + str(n) 
    #print fullURL 
    try: 
     req = urllib2.Request(fullURL) 
     resp = urllib2.urlopen(req) 
     if resp.getcode() == 404: 
      #Do whatever you want if 404 is found 
      print ("404 Found!") 
     else: 
      #Do your normal stuff here if page is found. 
      print ("URL: {0} Response: {1}".format(fullURL, resp.getcode())) 
    except: 
     print ("Could not connect to URL: {0} ".format(fullURL))

這種迭代，並試圖通過urllib2連接到每個URL。我不知道scapy或您的示例函數如何打開URL，但這是通過urllib2如何執行它的示例。

請注意，大多數使用此類URL格式的網站通常運行的CMS可以自動將不存在的頁面重定向到自定義404 - Not Found頁面，該頁面仍將顯示爲HTTP狀態代碼200.在這種情況下，尋找可能出現但實際上只是自定義404頁面的頁面的最佳方式是，您應該執行一些屏幕抓取並查找在「正常」頁面返回期間可能不會出現的任何內容，例如「Page not找到「或類似和獨特的自定義404頁面。

來源

2013-04-08 00:17:27 serk

根據我的經驗，大多數自定義404頁確實會返回404狀態碼。 – Taymon 2013-04-08 02:11:09

原來，他們沒有，我不能真正解決這個問題，沒有檢查他們的內容，這太慢了，但這個答案通常會解決問題。 – 2013-04-08 03:38:00

您需要產量/退貨該請求才能檢查狀態，創建一個Request對象實際上並沒有發送它。

class MySpider(BaseSpider): 
    name = 'website.com' 
    baseUrl = "http://website.com/page/" 

    def start_requests(self): 
     yield Request(self.baseUrl + '0') 

    def parse(self, response): 
     if response.status != 404: 
      page = response.meta.get('page', 0) + 1 
      return Request('%s%s' % (self.baseUrl, page), meta=dict(page=page))

來源

2013-04-08 02:13:22

檢查網址是否有404錯誤scrapy

回答

相關問題