當使用urllib.request循環瀏覽長列表時，Python會掛起使用urllib.request

我已經編寫了一些代碼，它通過url列表循環，使用urllib.request打開它們，然後使用beautifulsoup分析它們。唯一的問題是列表很長（大約5000），並且代碼在無限期地掛起之前成功運行約200個URL。有沒有辦法a）在特定時間後跳到下一個網址，例如30秒或者b）重新嘗試打開url一段時間，然後繼續下一個項目？當使用urllib.request循環瀏覽長列表時，Python會掛起使用urllib.request

from bs4 import BeautifulSoup 
import csv 
import urllib.request 
with open('csv_file.csv', 'r') as f: 
    reader = csv.reader(f) 
    urls_list = list(reader) 
    for j in range(0, len(urls_list)): 
    url= ''.join(urls_list[j]) 
    id=url[-10:].replace(".html","") 

    from urllib.request import Request, urlopen 
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
    s = urlopen(req).read() 
    soup = BeautifulSoup(s, "lxml")

任何建議非常感謝！

來源

2016-10-29 user3725021

有一個在'urlopen'方法的超時參數。用它。 –

謝謝，你能舉個簡單的例子嗎？我不知道如何使用它來做a）或b），如上所述。 – user3725021

該文檔（蟒2）表示：

的urllib2的模塊定義瞭如下功能： urllib2.urlopen（URL [，數據[，超時[，憑證檔案錯誤[，capath [，cadefault [，上下文]]]]]）打開URL url，可以是字符串或Request對象。

適應你的代碼是這樣的：

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
try: 
    s = urlopen(req,timeout=10).read() # 10 seconds 
exception HTTPError as e: 
    print(str(e)) # print error detail (this may not be a timeout after all!) 
    continue # skip to next element

來源

2016-10-29 07:38:33

當使用urllib.request循環瀏覽長列表時，Python會掛起使用urllib.request

回答

相關問題