2016-02-25 68 views
9

我想寫一些包裝代碼到我的數據庫調用(使用C#和Microsoft技術訪問數據庫),自動重試一個'瞬時'異常。通過暫時的,我的意思是有一個很好的機會最終可以解決的事情(與邏輯上的錯誤永遠不會起作用)。我能想到的例子包括:Sql Server瞬態異常編號

  • 死鎖
  • 連接超時
  • 命令超時

我已經使用SQLEXCEPTION的錯誤號碼發現這些計劃。因此,例如:

List<RunStoredProcedureResultType> resultSet = null; 
int limit = 3; 
for (int i = 0; i < limit; ++i) 
{ 
    bool isLast = i == limit - 1; 
    try 
    { 
     using (var db = /* ... */) 
     { 
      resultSet = db.RunStoredProcedure(param1, param2).ToList(); 
     } 
     //if it gets here it was successful 
     break; 
    } 
    catch (SqlException ex) 
    { 
     if (isLast) 
     { 
      //3 transient errors in a row. So just kill it 
      throw; 
     } 
     switch (ex.Number) 
     { 
      case 1205: //deadlock 
      case -2: //timeout (command timeout?) 
      case 11: //timeout (connection timeout?) 
       // do nothing - continue the loop 
       break; 
      default: 
       //a non-transient error. Just throw the exception on 
       throw; 
     } 
    } 
    Thread.Sleep(TimeSpan.FromSeconds(1)); //some kind of delay - might not use Sleep 
} 
return resultSet; 

(原諒我的任何錯誤 - 我剛寫了在飛行中我也知道我可以把它包起來很好......)

因此,關鍵問題是:我應該考慮什麼樣的數字是「短暫的」(我認識到我認爲短暫的可能與其他人認爲的短暫不同)。我發現一個很好的名單在這裏:

https://msdn.microsoft.com/en-us/library/cc645603.aspx

,但其巨大的,注意是非常有用的。 有沒有其他人建立了一個列表,他們使用類似的東西?

UPDATE

最終,我們選擇了一個「壞名單」 - 如果錯誤是已知的「非暫時性的錯誤」名單的一個 - 這通常是程序員的錯誤。我列出了我們用作答案的一系列數字。

+1

我們做了類似的事情。稱之爲「可恢復的例外」。包括連接錯誤,超時和死鎖。但是:當您重複三次調用時,死鎖可能會持續下去 - 考慮添加可變延遲或其他死鎖解決方法。並且由於過載導致的連接超時也可能會在您立即觸發兩次重試時變得更糟。 – dlatikay

+0

哦,是的,我計劃了一個延遲。謝謝@dlatikay - 將更新以上 – thab

+0

嗨,你正在尋求一個建議,不幸的是這不是一個可以解決的問題,因爲任何程序員可能會對你的情況有什麼不同的看法,所以用它當前編輯的方式這是關閉的話題;問候。 – jclozano

回答

2

對不起回答我自己的問題,但如果有人仍然有興趣,我們剛剛開始建立自己的錯誤代碼列表。不理想,但我們認爲這不應該經常發生。

我們選擇了一個'壞名單'的方法,而不是問題中暗示的'好名單'。我們至今的ID是:

PARAMETER_NOT_SUPPLIED = 201; 
CANNOT_INSERT_NULL_INTO_NON_NULL = 515; 
FOREGIN_KEY_VIOLATION = 547; 
PRIMARY_KEY_VIOLATION = 2627; 
MEMORY_ALLOCATION_FAILED = 4846; 
ERROR_CONVERTING_NUMERIC_TO_DECIMAL = 8114; 
TOO_MANY_ARGUMENTS = 8144; 
ARGUMENT_IS_NOT_A_PARAMETER = 8145; 
ARGS_SUPPLIED_FOR_PROCEDURE_WITHOUT_PARAMETERS = 8146; 
STRING_OR_BINARY_TRUNCATED = 8152; 
INVALID_POINTER = 10006; 
WRONG_NUMBER_OF_PARAMETERS = 18751; 

我們注意到的另一件事是,如果連接池超時,你沒有得到SQLException的 - 而不是你得到一個InvalidOperationException異常報告「超時已過期」。這是一個遺憾,它不是一個SqlException,但非常值得一提。

我會盡量保持這個最新的任何補充。

1

沒有可重試代碼的規範列表。其他球隊之前有過這個問題。 EF團隊制定了一個重試策略。你可能想要突襲他們的代碼。但是名單並不完整。我看到EF在GitHub上提交他們修改名單的提交。

我也有這個問題。我添加了一些明顯的錯誤代碼,我從SELECT * FROM sys.messages WHERE language_id = 1033 AND text LIKE '%...%'中挖出。然後,我在應用程序遇到它時添加了代碼。

您還需要重試特殊的錯誤號超時和網絡錯誤。由於連接斷開,服務器無法生成該號碼。我認爲這個數字是-2,但你需要確定。

SQL Server定義的錯誤級別對於此目的是無用的(主要是一般情況)。

6

在sql Azure中有一個類[SqlDatabaseTransientErrorDetectionStrategy.cs]用於瞬態故障處理。它涵蓋了幾乎所有類型的可視爲暫態的異常代碼。它也是一個完整的實現Retry strategy

添加片段供將來參考:

/// <summary> 
/// Error codes reported by the DBNETLIB module. 
/// </summary> 
private enum ProcessNetLibErrorCode 
{ 
    ZeroBytes = -3, 

    Timeout = -2, 
    /* Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. */ 

    Unknown = -1, 

    InsufficientMemory = 1, 

    AccessDenied = 2, 

    ConnectionBusy = 3, 

    ConnectionBroken = 4, 

    ConnectionLimit = 5, 

    ServerNotFound = 6, 

    NetworkNotFound = 7, 

    InsufficientResources = 8, 

    NetworkBusy = 9, 

    NetworkAccessDenied = 10, 

    GeneralError = 11, 

    IncorrectMode = 12, 

    NameNotFound = 13, 

    InvalidConnection = 14, 

    ReadWriteError = 15, 

    TooManyHandles = 16, 

    ServerError = 17, 

    SSLError = 18, 

    EncryptionError = 19, 

    EncryptionNotSupported = 20 
} 

另外的switch case檢查,如果在SQL異常返回的錯誤編號:

switch (err.Number) 
{ 
    // SQL Error Code: 40501 
    // The service is currently busy. Retry the request after 10 seconds. Code: (reason code to be decoded). 
    case ThrottlingCondition.ThrottlingErrorNumber: 
     // Decode the reason code from the error message to determine the grounds for throttling. 
     var condition = ThrottlingCondition.FromError(err); 

     // Attach the decoded values as additional attributes to the original SQL exception. 
     sqlException.Data[condition.ThrottlingMode.GetType().Name] = 
      condition.ThrottlingMode.ToString(); 
     sqlException.Data[condition.GetType().Name] = condition; 

     return true; 

    // SQL Error Code: 10928 
    // Resource ID: %d. The %s limit for the database is %d and has been reached. 
    case 10928: 
    // SQL Error Code: 10929 
    // Resource ID: %d. The %s minimum guarantee is %d, maximum limit is %d and the current usage for the database is %d. 
    // However, the server is currently too busy to support requests greater than %d for this database. 
    case 10929: 
    // SQL Error Code: 10053 
    // A transport-level error has occurred when receiving results from the server. 
    // An established connection was aborted by the software in your host machine. 
    case 10053: 
    // SQL Error Code: 10054 
    // A transport-level error has occurred when sending the request to the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) 
    case 10054: 
    // SQL Error Code: 10060 
    // A network-related or instance-specific error occurred while establishing a connection to SQL Server. 
    // The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server 
    // is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed 
    // because the connected party did not properly respond after a period of time, or established connection failed 
    // because connected host has failed to respond.)"} 
    case 10060: 
    // SQL Error Code: 40197 
    // The service has encountered an error processing your request. Please try again. 
    case 40197: 
    // SQL Error Code: 40540 
    // The service has encountered an error processing your request. Please try again. 
    case 40540: 
    // SQL Error Code: 40613 
    // Database XXXX on server YYYY is not currently available. Please retry the connection later. If the problem persists, contact customer 
    // support, and provide them the session tracing ID of ZZZZZ. 
    case 40613: 
    // SQL Error Code: 40143 
    // The service has encountered an error processing your request. Please try again. 
    case 40143: 
    // SQL Error Code: 233 
    // The client was unable to establish a connection because of an error during connection initialization process before login. 
    // Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy 
    // to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. 
    // (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) 
    case 233: 
    // SQL Error Code: 64 
    // A connection was successfully established with the server, but then an error occurred during the login process. 
    // (provider: TCP Provider, error: 0 - The specified network name is no longer available.) 
    case 64: 
    // DBNETLIB Error Code: 20 
    // The instance of SQL Server you attempted to connect to does not support encryption. 
    case (int)ProcessNetLibErrorCode.EncryptionNotSupported: 
     return true; 
} 

查看完整的source here

+1

這很好。只是添加一個說明,這個列表仍然不完整。 EF與其他人一樣無知。他們忘記了物品。示例:快照隔離寫入衝突。此外,我沒有看到該清單上的死鎖,這意味着非常可疑的價值清單。 – usr

+0

是的..我同意!但這是一個很好的名單。 – vendettamit

+0

謝謝@vendettamit - 看起來不錯,但正如usr提到的那樣,它錯過了僵局讓我覺得有點可疑,因爲這可能是我想重試的主要原因之一。我想也許這不是EF可以決定自動重試的東西嗎? – thab