如何確保BS4請求正在使用列表上的套接字進行？

我有這樣一個代理列表，我想與蟒蛇刮使用：如何確保BS4請求正在使用列表上的套接字進行？

proxies_ls = [ '149.56.89.166:3128', 
      '194.44.176.116:8080', 
      '14.203.99.67:8080', 
      '185.87.65.204:63909', 
      '103.206.161.234:63909', 
      '110.78.177.100:65103']

，並以放棄使用BS4一個URL做了一個功能，並請求模塊調用crawlSite（URL）。這裏的代碼：

# Bibliotecas para crawl e regex 
from bs4 import BeautifulSoup 
import requests 
from fake_useragent import UserAgent 
import re 

#Biblioteca para data 
import datetime 
from time import gmtime, strftime 

#Biblioteca para escrita dos logs 
import os 
import errno 

#Biblioteca para delay aleatorio 
import time 
import random 

print('BOT iniciado: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 

proxies_ls = [ '149.56.89.166:3128', 
      '194.44.176.116:8080', 
      '14.203.99.67:8080', 
      '185.87.65.204:63909', 
      '103.206.161.234:63909', 
      '110.78.177.100:65103'] 

def crawlSite(url): 
    #Chrome emulation 
    ua=UserAgent() 
    header={'user-agent':ua.chrome} 
    random.shuffle(proxies_ls) 

    #Random delay 
    print('antes do delay: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 
    tempoRandom=random.randint(1,5) 
    time.sleep(tempoRandom) 

    try: 
     randProxy=random.choice(proxies_ls) 
     # Getting the webpage, creating a Response object emulated with chrome with a 30sec timeout. 
     response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30) 
     print(response) 
     print('Resposta obtida: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 

     #Avoid HTTP request errors 
     if response.status_code == 404: 
      raise ConnectionError("HTTP Response [404] - The requested resource could not be found") 
     elif response.status_code == 409:    
      raise ConnectionError("HTTP Response [409] - Possible Cloudflare DNS resolution error") 
     elif response.status_code == 403: 
      raise ConnectionError("HTTP Response [403] - Permission denied error") 
     elif response.status_code == 503: 
      raise ConnectionError("HTTP Response [503] - Service unavailable error") 
     print('RR Status {}'.format(response.status_code)) 
     # Extracting the source code of the page. 
     data = response.text 

    except ConnectionError: 
     try: 
      proxies_ls.remove(randProxy) 
     except ValueError: 
      pass 
     randProxy=random.choice(proxies_ls) 

    return BeautifulSoup(data, 'lxml')

我想要做的是確保只有代理列表中正在使用的連接。隨機部分

randProxy=random.choice(proxies_ls)

好的工作，但如果代理是有效還是無效的檢查部分不是。主要是因爲我仍然收到200個作爲「補償代理」的答覆。

如果我減少列表如下：

proxies_ls = ['149.56.89.166:3128']

與不工作，我仍然得到200響應的代理！（我試圖使用像https://pt.infobyip.com/proxychecker.php代理檢查器，它不工作...）

所以我的問題是（我會列舉，所以它更容易）： a）爲什麼我得到這200響應，而不是4xx響應？ b）我如何強制請求使用代理服務器？

謝謝，

Eunito。

來源

2017-09-15 Eunito

仔細閱讀文檔：http://docs.python-requests.org/en/master/user/advanced/#proxies。您需要在代理字典中指定協議'requests.get（url，proxies = {'https'：'http：//％s'％randProxy}（...））''。現在你只傳遞一個IP地址和端口。 – Kalkran

嗨@Kalkran你是對的！但即使使用上面提到的唯一代理（proxies_ls = ['149.56.89.166:3128']）的更正，我仍然得到200 ... – Eunito

可能是因爲您正在爬取HTTP站點而不是HTTPS站點？您只是給它一個用於HTTPS站點的代理。 – Kalkran

所以基本上，如果我得到你的問題吧，你只是想檢查代理是否有效。requests有異常處理程序，你可以做這樣的事情：

from requests.exceptions import ProxyError 
try: 
    response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30) 
except ProxyError: 
    # message proxy is invalid

來源

2017-09-15 12:31:41 chad

如何確保BS4請求正在使用列表上的套接字進行？

回答

相關問題