2013-02-25 31 views
4

我對使用不同標準偏差標準的兩次通過的熊貓系列對象進行異常檢查。但是,我使用了兩個循環,運行速度非常慢。我想知道是否有任何熊貓「竅門」來加速這一步。在熊貓系列上加快異常值檢查

這裏是我使用的代碼(警告真正醜陋的代碼!):

def find_outlier(point, window, n): 
    return np.abs(point - nanmean(window)) >= n * nanstd(window) 

def despike(self, std1=2, std2=20, block=100, keep=0): 
    res = self.values.copy() 
    # First run with std1: 
    for k, point in enumerate(res): 
     if k <= block: 
      window = res[k:k + block] 
     elif k >= len(res) - block: 
      window = res[k - block:k] 
     else: 
      window = res[k - block:k + block] 
     window = window[~np.isnan(window)] 
     if np.abs(point - window.mean()) >= std1 * window.std(): 
      res[k] = np.NaN 
    # Second run with std2: 
    for k, point in enumerate(res): 
     if k <= block: 
      window = res[k:k + block] 
     elif k >= len(res) - block: 
      window = res[k - block:k] 
     else: 
      window = res[k - block:k + block] 
     window = window[~np.isnan(window)] 
     if np.abs(point - window.mean()) >= std2 * window.std(): 
      res[k] = np.NaN 
    return Series(res, index=self.index, name=self.name) 

回答

12

我不知道你與該塊件做的事情,但在一個系列應該儘可能尋找離羣簡單如:

In [1]: s > s.std() * 3 

其中s是你的系列,3是標準偏差超過離羣值的標準偏差。這個表達式將返回一系列然後你可以指數系列布爾值:

In [2]: s.head(10) 
Out[2]: 
0 1.181462 
1 -0.112049 
2 0.864603 
3 -0.220569 
4 1.985747 
5 4.000000 
6 -0.632631 
7 -0.397940 
8 0.881585 
9 0.484691 
Name: val 

In [3]: s[s > s.std() * 3] 
Out[3]: 
5 4 
Name: val 

UPDATE:

尋址關於塊註釋。我認爲你可以在這種情況下使用pd.rolling_std()

In [53]: pd.rolling_std(s, window=5).head(10) 
Out[53]: 
0   NaN 
1   NaN 
2   NaN 
3   NaN 
4 0.871541 
5 0.925348 
6 0.920313 
7 0.370928 
8 0.467932 
9 0.391485 

In [55]: abs(s) > pd.rolling_std(s, window=5) * 3 

Docstring: 
Unbiased moving standard deviation 

Parameters 
---------- 
arg : Series, DataFrame 
window : Number of observations used for calculating statistic 
min_periods : int 
    Minimum number of observations in window required to have a value 
freq : None or string alias/date offset object, default=None 
    Frequency to conform to before computing statistic 
    time_rule is a legacy alias for freq 

Returns 
------- 
y : type of input argument 
+0

嗨Zelazny7。因爲我需要將每個點與距離它僅100點而不是整個系列進行比較。這就是爲什麼我需要循環。 – ocefpaf 2013-02-25 20:13:20

+0

謝謝,那正是我需要的。 – ocefpaf 2013-02-26 15:16:48

+6

請注意,此解決方案假定數據是以零爲中心的。稍微更準確的答案:abs(s - s.mean())> pd.rolling_std(s,window = 5)* 3 – MarkAWard 2014-07-15 16:52:35