Scrapy不會更改代理

當嘗試使用Scrapy測試代理時，我遇到了一個問題。我想請與httpbin.org代理，使履帶：Scrapy不會更改代理

class CheckerSpider(scrapy.Spider): 
    name = "checker" 
    start_urls = (
     'https://www.httpbin.org/ip' 
    ) 
    connection = get_connection() 

    def start_requests(self): 

     with self.connection.cursor() as cursor: 
      limit = int((datetime.now() - datetime(1970, 1, 1)).total_seconds()) - 3600 
      q = """ SELECT * 
        FROM {} 
        WHERE active = 1 AND last_checked <= {} OR last_checked IS NULL;""".format(DB_TABLE, limit) 
      cursor.execute(q) 
      proxy_list = cursor.fetchall() 

     for proxy in proxy_list[:15]: 
      word = get_random_word() 
      req = scrapy.Request(self.start_urls, self.check_proxy, dont_filter=True) 
      req.meta['proxy'] = 'https://{}:8080'.format(proxy['ip']) 
      req.meta['item'] = proxy 
      user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD)) 
      req.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass) 
      req.headers['User-Agent'] = get_user_agent() 
      yield req 

    def check_proxy(self, response): 
     print response.request.meta['proxy'] 
     print response.meta['item']['ip'] 
     print response.body

但是當我測試它，我看到Scrapy連接到URL只有5個代理，然後並沒有改變它。示例輸出（只是混亂的IP）：

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None) 
https://192.168.100.130:8080 
192.168.100.130 
{ 
    "origin": "192.168.100.130" 
} 

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None) 
https://192.168.100.131:8080 
192.168.100.131 
{ 
    "origin": "192.168.100.131" 
} 
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None) 
https://192.168.100.132:8080 
192.168.100.132 
{ 
    "origin": "192.168.100.132" 
} 

# Here Scrapy used wrong proxy to connect to site. 
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None) 
https://192.168.100.134:8080 
192.168.100.134 
{ 
    "origin": "192.168.100.130" 
}

可能是我犯了錯誤？任何想法？謝謝。

UPD：其實，現在我正在使用中間件來添加代理請求。我把它按順序放在中間件中：

DOWNLOADER_MIDDLEWARES = { 
    'checker.middlewares.ProxyCheckMiddleware': 100, 
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 
}

但是我有同樣的結果。這是我的自定義中間件添加代理：

class ProxyCheckMiddleware(object): 

    def process_request(self, request, spider): 
     if 'proxy' not in request.meta: 
      request.meta['proxy'] = 'https://{}:8080'.format(request.meta['item']['ip']) 
      request.meta['handle_httpstatus_list'] = [302, 503] 
      user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD)) 
      request.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)

UPD。迄今爲止，似乎是Scrapy中的一個錯誤。看看這裏的對話：https://github.com/scrapy/scrapy/issues/1807

來源

2016-02-23 drjackild

在故事的結尾，這是Scrapy的錯誤，其固定在1.1.0版本（見conversation）。非常感謝redapple和rverbitsky的幫助！

來源

2016-06-18 01:48:30 drjackild

你試過this？

按照設置說明，使用指定格式的代理列表創建一個文本文件，並通過它運行請求。它隨機化使用的代理並丟棄在經過一定次數嘗試後失敗的代理。可以高度推薦它，目前使用它從代理列表hidemyass.com

來源

2016-06-16 10:05:52 dhdavvie

這不適合我的情況。我知道如何編寫中間件，但這是Scrapy中的一個錯誤（請參閱問題末尾的github鏈接）。已經在Scrapy 1.1.0中修復了 – drjackild

試試你的middlewares.py文件中的ProxyMiddleware。

class ProxyMiddleware(object): 

    def process_request(self, request, spider): 
     request.meta['proxy'] == 'https://{}:8080'.format(request.meta.get('item').get('ip')) 

     # If the proxy needs auth (you will also need to import base64 
     # proxy_auth = "username:password" 
     # encoded_auth = base64.encodestring(proxy_auth) 

     # request.headers['Proxy-Authorization'] = 'Basic ' + encoded_auth 
     return request

而在你settings.py文件：

DOWNLOADER_MIDDLEWARES = { 
    'checker.middlewares.ProxyMiddleware': 100, 
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110 
}

來源

2016-06-16 19:07:35 cmeadows

您的示例將產生不定式循環，如果中間件返回請求導致Scrapy重新排隊，並再次與所有中間件通話。實際上，這個問題已經解決了，請看問題末尾的github鏈接。在1.1.0中這個bug已經修復。 – drjackild

@drjackild它不會創建一個無限循環。你必須適當地訂購你的中間件，它不會像你說的那樣做。這是一個簡單的修復。我只是提供一些建議。 – cmeadows

Scrapy不會更改代理

回答

相關問題