2017-01-09 36 views
0

我試圖使用請求檢查巨大的代理列表。爲了做到這一點,我使用線程。因爲我使用相同的代碼結構後,爲了使一個網站,我做1個線程= 1個IP上的許多要求,我真的需要這些線程。請求與代理和線程

所以我排序的代碼是:

def proxyList(proxies, nbThread): 
    with open('proxyList.txt', 'w') as f: 
     f.write('') 
     f.close() 
    proxies = list(set(proxies)) 
    prox = [] 
    lenS = len(proxies) 
    pas = int(lenS/nbThread) 
    subSeq = [proxies[i*pas:(i+1)*pas] for i in range(nbThread)] 
    subSeq[nbThread-1]+=proxies[nbThread*pas:] 
    threads = [0 for i in range(nbThread)] 

    for i in range(nbThread): 
     threads[i] = proxy(subSeq[i],) 
    for i in range(nbThread): 
     threads[i].start(); 
    for i in range(nbThread): 
     threads[i].join(); 

    return list(set(prox)) 

class proxy(Thread): 
    def __init__(self, proxies): 
     Thread.__init__(self) 
     self.proxies = proxies 

    def run(self): 
     k=0 
     prox = [] 
     for proxy in self.proxies: 
      k+=1 
      try: 
       requests.get("https://api.ipify.org/?format=json", timeout=15, proxies={"https":str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])}) 
       try: 
        requests.get("https://api.ipify.org/?format=json", timeout=15, proxies={"https":str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])}) 
        prox+=[str(proxy)] 
        print("Bon proxy : " + str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])) 
        with open('proxyList.txt', 'a') as f: 
         f.writelines(str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])+'\n'); 
         f.close() 
       except: 
        t = "a" 
      except: 
       print("Mauvais proxy : "+ str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])) 
       print(sys.exc_info()[0]) 
     print("Terminé: "+str(k), prox) 

它的工作原理,但我並不總是有相同的輸出結果,它是高度相關的線程我設置的數量。

你們是否有一個想法,我已經看到了,也許是請求不是最好的選擇,在這裏,但我真的需要我的主題我的代理。

感謝, Djokx

回答

1

我敢肯定,要求是最好的方式。以下是討論。

https://gist.github.com/kennethreitz/973705

但我試圖讓你的代碼的一些改進,它一直在做同樣的工作,並防止所謂的「get」方法兩次循環下降。

希望它可以幫助

def proxyList(proxies, nbThread): 
    with open('proxyList.txt', 'w') as f: 
     f.write('') 
     f.close() 
    proxies = list(set(proxies)) 
    prox = [] 
    lenS = len(proxies) 
    pas = int(lenS/nbThread) 
    subSeq = [proxies[i*pas:(i+1)*pas] for i in range(nbThread)] 
    subSeq[nbThread-1]+=proxies[nbThread*pas:] 
    threads = [0 for i in range(nbThread)] 

    for i in range(nbThread): 
     threads[i] = proxy(subSeq[i],) 
     threads[i].start() 
     threads[i].join() 

    return list(set(prox)) 

class proxy(Thread): 
    def __init__(self, proxies): 
     Thread.__init__(self) 
     self.proxies = proxies 

    def run(self): 
     k=0 
     prox = [] 
     for proxy in self.proxies: 
      k+=1 
      try: 
       s = requests.Session() 
       try: 
        s.get("https://api.ipify.org/?format=json", timeout=15, proxies={"https":str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])}) 
        prox+=[str(proxy)] 
        print("Bon proxy : " + str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])) 
        with open('proxyList.txt', 'a') as f: 
         f.writelines(str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])+'\n'); 
         f.close() 
       except: 
        t = "a" 
      except: 
       print("Mauvais proxy : "+ str(re.findall(r'[0-9]+(?:\.[0-9]+){3}:[0-9]+', proxy)[0])) 
       print(sys.exc_info()[0]) 
     print("Terminé: "+str(k), prox) 
+0

謝謝您的回答。但是當我嘗試運行proxyList兩次,使用相同的參數時,結果在兩者之間是非常不同的。你知道爲什麼嗎 ? 感謝 –

+0

順便說一句,感謝您的代碼改進隊友! :) 但似乎如果我做: 爲我的range(nbThread): 線程[I] =代理(SUBSEQ [I]) 線程[I]。開始() 線程[我]。加入() 它會等待第一個線程被停止,啓動第二個,不?我試過了,速度要慢得多! –

+0

我以爲你不需要等待完成每一個循環。所以,我建議在一個循環中插入start和join。這就是爲什麼你應該等待更少,但是有沒有多線程不存在?如果是這樣,你是對的。這將需要更大的延遲。 最終你能澄清什麼樣的結果,在不同? – nuriselcuk