2012-05-10 43 views
1

如何修改我的腳本跳過一個URL,如果連接超時或無效/ 404?Python urllib跳過URL上的HTTP或URL錯誤

的Python

#!/usr/bin/python 

#parser.py: Downloads Bibles and parses all data within <article> tags. 

__author__  = "Cody Bouche" 
__copyright__ = "Copyright 2012 Digital Bible Society" 

from BeautifulSoup import BeautifulSoup 
import lxml.html as html 
import urlparse 
import os, sys 
import urllib2 
import re 

print ("downloading and parsing Bibles...") 
root = html.parse(open('links.html')) 
for link in root.findall('//a'): 
    url = link.get('href') 
    name = urlparse.urlparse(url).path.split('/')[-1] 
    dirname = urlparse.urlparse(url).path.split('.')[-1] 
    f = urllib2.urlopen(url) 
    s = f.read() 
    if (os.path.isdir(dirname) == 0): 
     os.mkdir(dirname) 
    soup = BeautifulSoup(s) 
    articleTag = soup.html.body.article 
    converted = str(articleTag) 
    full_path = os.path.join(dirname, name) 
    open(full_path, 'wb').write(converted) 
    print(name) 
print("DOWNLOADS COMPLETE!") 

回答

2

要應用超時請求將timeout變量添加到urlopen的呼叫中。從docs

可選的超時參數指定 像連接嘗試阻塞操作以秒超時(如果沒有指定, 全局默認超時設置將被使用)。這實際上只適用於HTTP,HTTPS和FTP連接 。

有關how to handle exceptions with urllib2請參閱本指南的部分。其實我發現整個指南非常有用。

request timeout例外代碼是408。包裝它,如果你要處理超時異常,你會:

try: 
    response = urlopen(req, 3) # 3 seconds 
except URLError, e: 
    if hasattr(e, 'code'): 
     if e.code==408: 
      print 'Timeout ', e.code 
     if e.code==404: 
      print 'File Not Found ', e.code 
     # etc etc 
1

嘗試把你的urlopen線嘗試捕捉statment下。看這件事:

docs.python.org/tutorial/errors.html 8.3節

看看不同的異常,當你遇到一個剛剛重新啓動使用語句循環繼續