2015-05-22 15 views
2

如果在數據集缺失值上使用scipy.mstats.theilslopes例程,則斜率估計的下限和上限結果不正確。上限通常是/總是(?)NaN,而下限是完全錯誤的。發生這種情況的原因是,theilslopes例程計算排序後的斜率數組中的索引,並且該數組包含缺少值的斜率。scipy.mstats.theilslopes如果數據缺失值,則置信度限制錯誤

解決方法是在分析之前刪除缺失的值,但這並未記錄在案。

爲了說明問題,這裏是一個簡單的代碼片段: 進口numpy的是NP 從scipy.stats導入mstats

x = np.arange(12) 
y = np.array([28.9, 26.2, 27.2, 26.5, 28.4, 25.3, 26.1, 24.8, 27.7, 
       np.nan, np.nan, 29.6]) 

slope, intercept, lo_slope, up_slope = mstats.theilslopes(y, x, 
                  alpha=0.1) 
print "incorrect: ", slope, lo_slope, up_slope 

idx = [0, 1, 2, 3, 4, 5, 6, 7, 8, 11] 
x = x[idx] # equivalent to pandas series.dropna() 
y = y[idx] 

slope, intercept, lo_slope, up_slope = mstats.theilslopes(y, x, 
                  alpha=0.1) 
print "correct: ", slope, lo_slope, up_slope 

回答

2

mstats模塊scipy.stats,「缺失值」使用的是處理masked arraynan不表示缺少值。

下面展示瞭如何您的陣列y(使用nan遺漏值)轉換成蒙面陣列my

In [48]: x = np.arange(12) 

In [49]: y = np.array([28.9, 26.2, 27.2, 26.5, 28.4, 25.3, 26.1, 24.8, 27.7, np.nan, np.nan, 29.6]) 

In [50]: my = np.ma.masked_array(y, mask=np.isnan(y)) 

In [51]: my 
Out[51]: 
masked_array(data = [28.9 26.2 27.2 26.5 28.4 25.3 26.1 24.8 27.7 -- -- 29.6], 
      mask = [False False False False False False False False False True True False], 
     fill_value = 1e+20) 

In [52]: slope, intercept, lo_slope, up_slope = mstats.theilslopes(my, x, alpha=0.1) 

In [53]: print "correct: ", slope, lo_slope, up_slope 
correct: -0.125 -0.48 0.3875 

順便說一句,請確保您使用的版本至少爲scipy的0.15.0。舊版本中的theilslopes有一些錯誤:https://github.com/scipy/scipy/pull/3574