2017-03-15 41 views
1

我在寫一個將連接到遠程postgres服務器的服務。 我正在尋找一種確定哪些異常應該被視爲臨時(值得重試)以及如何定義連接到遠程數據庫的適當策略的好方法。如何從Npgsql異常中判斷該呼叫是否值得重試(瞬態故障策略)

該服務使用Npgsql進行數據訪問。 該文檔說Npgsql將針對sql錯誤拋出一個PostgresException,併爲「服務器相關問題」拋出一個NpgsqlException。

到目前爲止,我所能想到的最好方式是假設所有不是PostgresExceptions的異常應該被視爲可能是暫時的,值得重試,但是PostgresException將意味着查詢有問題,重試不會有幫助。我在這個假設中糾正了嗎?

我正在使用Polly創建重試和斷路器策略。 因此,我的政策是這樣的:

Policy.Handle<Exception>(AllButPotgresExceptions()) // if its a postgres exception we know its not going to work even with a retry, so don't 
         .WaitAndRetryAsync(new[] 
         { 
          TimeSpan.FromSeconds(1), 
          TimeSpan.FromSeconds(2), 
          TimeSpan.FromSeconds(4) 
         }, onRetry: (exception, span) => Log.Warning(exception, "Postgres Retry Failure: ")) 
        .WrapAsync(
          Policy.Handle<Exception>(AllButPotgresExceptions()) 
           .AdvancedCircuitBreakerAsync(
            failureThreshold:.7, 
            samplingDuration: TimeSpan.FromSeconds(30), 
            minimumThroughput: 20, 
            durationOfBreak: TimeSpan.FromSeconds(30), 
            onBreak: (ex, timeSpan, context) => Log.Warning(ex, "Postres Circuit Breaker Broken: "), 
            onReset: (context) => Log.Warning("Postres Circuit Breaker Reset: "), 
            onHalfOpen:() => Log.Warning("Postres Circuit Breaker Half Open: ") 
           ))); 
     } 
    } 

    private static Func<Exception, bool> AllButPotgresExceptions() 
    { 
     return ex => ex.GetType() != typeof(PostgresException); 
    } 

是否有更好的方法來確定哪些錯誤可能是短暫的?

UPDATE:

繼吉文的建議,我在開Npgsql的一個新的問題,更新了我的政策是這樣的:

public static Policy PostresTransientFaultPolicy 
    { 
     get 
     { 
      return postgresTransientPolicy ?? (postgresTransientPolicy = Policy.Handle<Exception>(PostgresDatabaseTransientErrorDetectionStrategy()) 
         .WaitAndRetryAsync(
          retryCount: 10, 
          sleepDurationProvider: retryAttempt => ExponentialBackoff(retryAttempt, 1.4), 
          onRetry: (exception, span) => Log.Warning(exception, "Postgres Retry Failure: ")) 
        .WrapAsync(
          Policy.Handle<Exception>(PostgresDatabaseTransientErrorDetectionStrategy()) 
           .AdvancedCircuitBreakerAsync(
            failureThreshold:.4, 
            samplingDuration: TimeSpan.FromSeconds(30), 
            minimumThroughput: 20, 
            durationOfBreak: TimeSpan.FromSeconds(30), 
            onBreak: (ex, timeSpan, context) => Log.Warning(ex, "Postres Circuit Breaker Broken: "), 
            onReset: (context) => Log.Warning("Postres Circuit Breaker Reset: "), 
            onHalfOpen:() => Log.Warning("Postres Circuit Breaker Half Open: ") 
           ))); 
     } 
    } 

    private static TimeSpan ExponentialBackoff(int retryAttempt, double exponent) 
    { 
     //TODO add random %20 variance on the exponent 
     return TimeSpan.FromSeconds(Math.Pow(retryAttempt, exponent)); 
    } 

    private static Func<Exception, bool> PostgresDatabaseTransientErrorDetectionStrategy() 
    { 
     return (ex) => 
     {     
      //if it is not a postgres exception we must assume it will be transient 
      if (ex.GetType() != typeof(PostgresException)) 
       return true; 

      var pgex = ex as PostgresException; 
      switch (pgex.SqlState) 
      { 
       case "53000": //insufficient_resources 
       case "53100": //disk_full 
       case "53200": //out_of_memory 
       case "53300": //too_many_connections 
       case "53400": //configuration_limit_exceeded 
       case "57P03": //cannot_connect_now 
       case "58000": //system_error 
       case "58030": //io_error 

       //These next few I am not sure whether they should be treated as transient or not, but I am guessing so 

       case "55P03": //lock_not_available 
       case "55006": //object_in_use 
       case "55000": //object_not_in_prerequisite_state 
       case "08000": //connection_exception 
       case "08003": //connection_does_not_exist 
       case "08006": //connection_failure 
       case "08001": //sqlclient_unable_to_establish_sqlconnection 
       case "08004": //sqlserver_rejected_establishment_of_sqlconnection 
       case "08007": //transaction_resolution_unknown 
        return true; 
      } 

      return false; 
     }; 
    } 

回答

1

你的做法是很好的。 NpgsqlException通常意味着一個網絡/ IO錯誤,雖然您可以檢查內部異常並檢查IOException是否確定。

PostgreSQL報告錯誤時會拋出PostgresException,這在大多數情況下是查詢的問題。但是,可能會出現一些暫時的服務器端問題(例如連接太多),您可以檢查SQL錯誤代碼 - 請參閱the PG docs

爲這些異常添加一個IsTransient屬性可能是一個好主意,它在PostgreSQL本身內編碼這些檢查 - 歡迎您在Npgsql回購中爲這些檢查打開一個問題。

+0

感謝您的幫助和建議。我會更新我的政策以查找一些特定的錯誤代碼,然後在此處發佈我的修改政策。 –

+0

太好了。請在https://github.com/npgsql/npgsql上打開一個問題 - 我們可以將您的策略​​納入Npgsql本身。 –

+0

根據我的經驗,最重要的瞬時錯誤是代碼40001(事務序列化失敗)。 –