2016-09-19 168 views
1

我有以下代碼:大熊貓:錯誤時迴路在給定的大熊貓行

df_boundry = df_in.dropna().quantile([0.0, .8]) 
for row in df_in.iterrows(): 
    for column in row: 
     if row[column] > df_boundry[column][0.8]: 
      row[column] = df_boundry[column][0.8] 

基本上,每一個給定的行(記錄),我們檢查每個列的值。如果該值超過80百分位,我們將其替換爲80-百分值。但是我在上面的代碼中的錯誤:

--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-67-81b2be77cc8a> in <module>() 
     4 for row in df_in.iterrows(): 
     5  for column in row: 
----> 6   if row[column] > df_boundry[column][0.8]: 
     7    row[column] = df_boundry[column][0.8] 
     8 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key) 
    1995    return self._getitem_multilevel(key) 
    1996   else: 
-> 1997    return self._getitem_column(key) 
    1998 
    1999  def _getitem_column(self, key): 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key) 
    2002   # get column 
    2003   if self.columns.is_unique: 
-> 2004    return self._get_item_cache(key) 
    2005 
    2006   # duplicate columns & possible reduce dimensionality 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item) 
    1348   res = cache.get(item) 
    1349   if res is None: 
-> 1350    values = self._data.get(item) 
    1351    res = self._box_item_values(item, values) 
    1352    cache[item] = res 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath) 
    3288 
    3289    if not isnull(item): 
-> 3290     loc = self.items.get_loc(item) 
    3291    else: 
    3292     indexer = np.arange(len(self.items))[isnull(self.items)] 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance) 
    1945     return self._engine.get_loc(key) 
    1946    except KeyError: 
-> 1947     return self._engine.get_loc(self._maybe_cast_indexer(key)) 
    1948 
    1949   indexer = self.get_indexer([key], method=method, tolerance=tolerance) 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)() 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)() 

KeyError: 0 

這裏是df_in一些示例數據:

column_A | column_B | column_C 
    -------------------------------- 
    0.5  | 0.5 | NaN 
    1.2  | NaN | NaN 
    NaN  | 8.1 | 21.1 
    9.1  | 9.3 | 2.1 
    4.5  | 90.1 | 1.4 
    112.3  | 79.2 | 1.1 
     : 
     : 

和df_boundry:

| column_A | column_B | column_C 
---------------------------------------- 
0.0 |  0.1 | 0.4  | 0.0 
0.8 | 110.4 | 80.1  | 20.5 

爲樣本數據應該是預期的成果

column_A | column_B | column_C 
    -------------------------------- 
    0.5  | 0.5 | NaN 
    1.2  | NaN | NaN 
    NaN  | 8.1 | 20.5 
    9.1  | 9.3 | 2.1 
    4.5  | 80.1 | 1.4 
    110.4  | 79.2 | 1.1 
     : 
     : 

即只有當單元格值> df_boundry [column] [0.8]時,我們用df_boundry [column] [0.8]代替它。

有沒有人知道我在這裏錯過了什麼?謝謝!

+0

你能發佈一個樣本數據集(5-7行)嗎? – MaxU

+0

只要你明白錯誤,df_in.iterrows()就會返回一個(index,row)的元組。你可以通過在df_in.iterrows()中執行'idx,row'來解決這個問題,但即使在你這樣做之後,row也是一個系列,所以'for行中的列'實際上是返回行中的每個值。嘗試在循環中打印一些變量以進一步探索它。 – shawnheide

回答

2

UPDATE2:

In [7]: df_boundry 
Out[7]: 
    column_A column_B column_C 
0.0  0.1  0.4  0.0 
0.8  110.4  80.1  20.5 

In [8]: df_boundry.iloc[-1] 
Out[8]: 
column_A 110.4 
column_B  80.1 
column_C  20.5 
Name: 0.8, dtype: float64 

In [9]: df_boundry.iloc[[-1]] 
Out[9]: 
    column_A column_B column_C 
0.8  110.4  80.1  20.5 

UPDATE:仍然使用相同的舊代碼,並提供更新的DF:

In [373]: df_boundry 
Out[373]: 
    column_A column_B column_C 
0.0  0.1  0.4  0.0 
0.8  110.4  80.1  20.5 

In [374]: df 
Out[374]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 
5  112.3  79.2  1.1 

In [375]: sav = df.copy() 

In [376]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 

In [377]: df 
Out[377]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 
5  110.4  79.2  1.1 

OLD答案:

你可以做到這一點這個(矢量化)的方式:

In [350]: df 
Out[350]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 

In [351]: df_boundry = df.dropna().quantile([0.0, .8]) 

In [352]: df_boundry 
Out[352]: 
    column_A column_B column_C 
0.0  4.50  1.10  1.40 
0.8  8.18  7.66  1.96 

In [353]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 

In [354]: df 
Out[354]: 
    column_A column_B column_C 
0  0.50  0.50  NaN 
1  1.20  NaN  NaN 
2  NaN  7.66  1.10 
3  8.18  7.66  1.96 
4  4.50  1.10  1.40 

說明:

In [365]: df > df_boundry.iloc[-1] 
Out[365]: 
    column_A column_B column_C 
0 False False False 
1 False False False 
2 False  True False 
3  True  True  True 
4 False False False 

In [356]: df_boundry.iloc[[-1]] 
Out[356]: 
    column_A column_B column_C 
0.8  8.18  7.66  1.96 

In [357]: pd.concat([df_boundry.iloc[[-1]]] * len(df)) 
Out[357]: 
    column_A column_B column_C 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 

In [358]: pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 
Out[358]: 
    column_A column_B column_C 
0  8.18  7.66  1.96 
1  8.18  7.66  1.96 
2  8.18  7.66  1.96 
3  8.18  7.66  1.96 
4  8.18  7.66  1.96 
+0

我修改了我的問題,對不起,我以前可能沒有說清楚。只有當單元格值> df_boundry [column] [0.8]時,我們纔會替換它。謝謝! – Edamame

+1

@Edamame,這正是我的代碼正在做的。 'df_boundry.iloc [-1]' - 給你上邊界。我已將'df> df_boundry.iloc [-1]'的輸出添加到解釋部分 – MaxU

+1

@Edamame中,當然,您可以使用'df_boundry.ix [[0.8]]'而不是'df_boundry.iloc [[ - 1]]'如果你想... – MaxU

1

而是計算所有位數前期的,你可以使用數據框的apply方法和每列分別操作。

def fill_with_quantile(col, q=0.80): 
    q_value = col.dropna().quantile(q) 
    col[col > q_value] = q_value 
    return col 

df_in.apply(lambda col: fill_with_quantile(col, 0.8), axis=0) 

如果你想你也可以改變fill_with_quantile函數來填充上下極端值(即0.2和0.8)。