如何通過網址抓取（python）捕獲所有可能的錯誤？

在我的應用程序中，用戶輸入一個URL，然後嘗試打開鏈接並獲取頁面標題。但是我意識到可能存在許多不同類型的錯誤，包括標題中的unicode字符或換行符，以及AttributeError和IOError。我第一次嘗試捕捉每個錯誤，但現在如果出現url提取錯誤，我想重定向到用戶將手動輸入標題的錯誤頁面。我如何捕獲所有可能的錯誤？這是我現在的代碼：如何通過網址抓取（python）捕獲所有可能的錯誤？

title = "title" 

    try: 

     soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) 
     title = str(soup.html.head.title.string) 

     if title == "404 Not Found": 
      self.redirect("/urlparseerror") 
     elif title == "403 - Forbidden": 
      self.redirect("/urlparseerror")  
     else: 
      title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n") 

    except UnicodeDecodeError:  
     self.redirect("/urlparseerror?error=UnicodeDecodeError") 

    except AttributeError:   
     self.redirect("/urlparseerror?error=AttributeError") 

    #https url:  
    except IOError:   
     self.redirect("/urlparseerror?error=IOError") 


    #I tried this else clause to catch any other error 
    #but it does not work 
    #this is executed when none of the errors above is true: 
    # 
    #else: 
    # self.redirect("/urlparseerror?error=some-unknown-error-caught-by-else")

UPDATE

正如我說try...except一邊寫title到數據庫中的意見建議由@Wooble：

 try: 
      new_item = Main(
         .... 
         title = unicode(title, "utf-8")) 

      new_item.put() 

     except UnicodeDecodeError:  

      self.redirect("/urlparseerror?error=UnicodeDecodeError")

這工作。儘管外的範圍內的字符â€」仍處於title根據日誌記錄信息：

***title: 7.2. re â€」 Regular expression operations &mdash; Python v2.7.1 documentation**

你知道爲什麼嗎？

來源

2011-03-05 Zeynel

一個的UnicodeDecodeError幾乎可以肯定是因爲你的代碼不正確處理Unicode的，不會因爲用戶輸入無效數據。你應該修復你的應用程序來處理unicode。 – 2011-03-07 23:52:47

您可以使用except，但不指定任何類型來捕獲所有異常。

從python文檔http://docs.python.org/tutorial/errors.html：（即一個例外是IO錯誤或ValueError異常的不）

import sys 

try: 
    f = open('myfile.txt') 
    s = f.readline() 
    i = int(s.strip()) 
except IOError as (errno, strerror): 
    print "I/O error({0}): {1}".format(errno, strerror) 
except ValueError: 
    print "Could not convert data to an integer." 
except: 
    print "Unexpected error:", sys.exc_info()[0] 
    raise

最後除了將趕上以前尚未抓到任何異常

來源

2011-03-05 23:32:00 Hernan

好的。我用最後一個'except'子句改變了代碼，但是即使現在'UnicodeDecodeError'也沒有被捕獲：UnicodeDecodeError：'ascii'編解碼器無法解碼位置12中的字節0xe2：序號不在範圍內（128）' em-dash在這個URL：'http：// docs.python.org/library/string.html'）我做錯了什麼？ – Zeynel 2011-03-05 23:56:32

感謝您的回答。解決了這個問題。 – Zeynel 2011-03-06 19:45:00

您可以使用頂級異常類型Exception，它會捕獲之前沒有捕獲到的任何異常。

http://docs.python.org/library/exceptions.html#exception-hierarchy

try: 

    soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) 
    title = str(soup.html.head.title.string) 

    if title == "404 Not Found": 
     self.redirect("/urlparseerror") 
    elif title == "403 - Forbidden": 
     self.redirect("/urlparseerror")  
    else: 
     title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n") 

except UnicodeDecodeError:  
    self.redirect("/urlparseerror?error=UnicodeDecodeError") 

except AttributeError:   
    self.redirect("/urlparseerror?error=AttributeError") 

#https url:  
except IOError:   
    self.redirect("/urlparseerror?error=IOError") 

except Exception, ex: 
    print "Exception caught: %s" % ex.__class__.__name__

來源

2011-03-05 23:56:49 ssoler

謝謝。但是這也沒有發現unicode錯誤。不知道我做錯了什麼。 – Zeynel 2011-03-06 00:04:30

@Zeynel，你可以在python的異常層次結構中看到（http://docs.python.org/library/exceptions.html#exception-hierarchy）UnicodeDecodeError是Exception的一個子類型，所以應該抓住它。可能是你的錯誤出現在你的代碼的不同部分。 – ssoler 2011-03-06 00:31:25

@ssoler：是的，當我嘗試將標題寫入數據庫時發生錯誤。標題中有一個unicode錯誤，它不會寫入。試圖捕捉URL獲取錯誤的關鍵是避免處理python unicode惡夢。似乎沒有辦法用'try ... except'來捕捉Unicode錯誤。我不想處理unicode問題，所以我放棄了......這意味着用戶在提交url時需要輸入標題。我很驚訝，在互聯網技術的這個階段，我無法得到一個頁面的標題沒有錯誤！那麼，我不知道該說些什麼...... – Zeynel 2011-03-06 00:53:03

如何通過網址抓取（python）捕獲所有可能的錯誤？

回答

相關問題