2013-08-16 81 views
0

現在我有這樣的代碼tenatively:Python:如何通過代理列表訪問URL列表?

import json 
import urllib2 

with open('proxies.txt') as proxies: 
    for line in proxies: 
     proxy = json.loads(line) 
     proxy_handler = urllib2.ProxyHandler(proxy) 
     opener = urllib2.build_opener(proxy_handler) 
     urllib2.install_opener(opener) 
with open('urls.txt') as urls: 
    for line in urls: 
     url = line.rstrip() 
     data = urllib2.urlopen(url).read() 
     print data 

我proxies.txt文件如下:

{"https": "https://94.142.27.4:3128"} 
{"http": "http://118.97.95.174:8080"} 
{"http": "http://66.62.236.15:8080"} 

和我urls.txt文件如下:

http://www.google.com 
http://www.facebook.com 
http://www.reddit.com 

它似乎是安裝所有代理,然後處理列表中的每個URL,並安裝了所有代理。然而,我真正想要的是通過單獨的代理來訪問每個網址。所以

  1. 訪問爲url1通過代理1
  2. 訪問URL2通過代理1
  3. 訪問URL3通過代理1
  4. 訪問爲url1通過代理2​​
  5. 訪問URL2通過代理2​​
  6. 訪問URL3通過代理2
  7. 通過代理訪問url1 3
  8. 通過代理3訪問url2通過代理
  9. 訪問URL3 3

是有辦法做到這一點?它已經在做這個嗎?我是否誤解了代理真的是什麼?我誤解了install_opener的確做了什麼?

回答

3

我不知道這正是你想要的,但...

既然你想嘗試通過所有代理的所有網址,你可以使用itertools.product輕鬆構建所有組合:

import itertools 

with open('proxies.txt') as proxies: 
    with open('urls.txt') as urls: 
     for (proxie, url) in itertools.product(proxies, urls): 
      print "access", url.rstrip(), "using", proxie.rstrip() 

當然,而不是print你將不得不插入你的實際代碼。


這就是說,只有真正問題與你原來的代碼可能縮進。你想嵌套循環。所以你應該怎麼寫:

with open('proxies.txt') as proxies: 
    for line in proxies: 
     proxy = json.loads(line) 
     proxy_handler = urllib2.ProxyHandler(proxy) 
     opener = urllib2.build_opener(proxy_handler) 
     urllib2.install_opener(opener) 

     with open('urls.txt') as urls: 
      for line in urls: 
       url = line.rstrip() 
       data = urllib2.urlopen(url).read() 
       print data