當嘗試使用Scrapy測試代理時,我遇到了一個問題。我想請與httpbin.org代理,使履帶:Scrapy不會更改代理
class CheckerSpider(scrapy.Spider):
name = "checker"
start_urls = (
'https://www.httpbin.org/ip'
)
connection = get_connection()
def start_requests(self):
with self.connection.cursor() as cursor:
limit = int((datetime.now() - datetime(1970, 1, 1)).total_seconds()) - 3600
q = """ SELECT *
FROM {}
WHERE active = 1 AND last_checked <= {} OR last_checked IS NULL;""".format(DB_TABLE, limit)
cursor.execute(q)
proxy_list = cursor.fetchall()
for proxy in proxy_list[:15]:
word = get_random_word()
req = scrapy.Request(self.start_urls, self.check_proxy, dont_filter=True)
req.meta['proxy'] = 'https://{}:8080'.format(proxy['ip'])
req.meta['item'] = proxy
user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
req.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)
req.headers['User-Agent'] = get_user_agent()
yield req
def check_proxy(self, response):
print response.request.meta['proxy']
print response.meta['item']['ip']
print response.body
但是當我測試它,我看到Scrapy連接到URL只有5個代理,然後並沒有改變它。示例輸出(只是混亂的IP):
2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.130:8080
192.168.100.130
{
"origin": "192.168.100.130"
}
2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.131:8080
192.168.100.131
{
"origin": "192.168.100.131"
}
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.132:8080
192.168.100.132
{
"origin": "192.168.100.132"
}
# Here Scrapy used wrong proxy to connect to site.
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.134:8080
192.168.100.134
{
"origin": "192.168.100.130"
}
可能是我犯了錯誤?任何想法?謝謝。
UPD: 其實,現在我正在使用中間件來添加代理請求。我把它按順序放在中間件中:
DOWNLOADER_MIDDLEWARES = {
'checker.middlewares.ProxyCheckMiddleware': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
但是我有同樣的結果。這是我的自定義中間件添加代理:
class ProxyCheckMiddleware(object):
def process_request(self, request, spider):
if 'proxy' not in request.meta:
request.meta['proxy'] = 'https://{}:8080'.format(request.meta['item']['ip'])
request.meta['handle_httpstatus_list'] = [302, 503]
user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
request.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)
UPD。 迄今爲止,似乎是Scrapy中的一個錯誤。看看這裏的對話:https://github.com/scrapy/scrapy/issues/1807
這不適合我的情況。我知道如何編寫中間件,但這是Scrapy中的一個錯誤(請參閱問題末尾的github鏈接)。已經在Scrapy 1.1.0中修復了 – drjackild