大熊貓：錯誤時迴路在給定的大熊貓行

我有以下代碼：大熊貓：錯誤時迴路在給定的大熊貓行

df_boundry = df_in.dropna().quantile([0.0, .8]) 
for row in df_in.iterrows(): 
    for column in row: 
     if row[column] > df_boundry[column][0.8]: 
      row[column] = df_boundry[column][0.8]

基本上，每一個給定的行（記錄），我們檢查每個列的值。如果該值超過80百分位，我們將其替換爲80-百分值。但是我在上面的代碼中的錯誤：

--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-67-81b2be77cc8a> in <module>() 
     4 for row in df_in.iterrows(): 
     5  for column in row: 
----> 6   if row[column] > df_boundry[column][0.8]: 
     7    row[column] = df_boundry[column][0.8] 
     8 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key) 
    1995    return self._getitem_multilevel(key) 
    1996   else: 
-> 1997    return self._getitem_column(key) 
    1998 
    1999  def _getitem_column(self, key): 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key) 
    2002   # get column 
    2003   if self.columns.is_unique: 
-> 2004    return self._get_item_cache(key) 
    2005 
    2006   # duplicate columns & possible reduce dimensionality 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item) 
    1348   res = cache.get(item) 
    1349   if res is None: 
-> 1350    values = self._data.get(item) 
    1351    res = self._box_item_values(item, values) 
    1352    cache[item] = res 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath) 
    3288 
    3289    if not isnull(item): 
-> 3290     loc = self.items.get_loc(item) 
    3291    else: 
    3292     indexer = np.arange(len(self.items))[isnull(self.items)] 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance) 
    1945     return self._engine.get_loc(key) 
    1946    except KeyError: 
-> 1947     return self._engine.get_loc(self._maybe_cast_indexer(key)) 
    1948 
    1949   indexer = self.get_indexer([key], method=method, tolerance=tolerance) 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)() 

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)() 

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)() 

KeyError: 0

這裏是df_in一些示例數據：

column_A | column_B | column_C 
    -------------------------------- 
    0.5  | 0.5 | NaN 
    1.2  | NaN | NaN 
    NaN  | 8.1 | 21.1 
    9.1  | 9.3 | 2.1 
    4.5  | 90.1 | 1.4 
    112.3  | 79.2 | 1.1 
     : 
     :

和df_boundry：

| column_A | column_B | column_C 
---------------------------------------- 
0.0 |  0.1 | 0.4  | 0.0 
0.8 | 110.4 | 80.1  | 20.5

爲樣本數據應該是預期的成果

column_A | column_B | column_C 
    -------------------------------- 
    0.5  | 0.5 | NaN 
    1.2  | NaN | NaN 
    NaN  | 8.1 | 20.5 
    9.1  | 9.3 | 2.1 
    4.5  | 80.1 | 1.4 
    110.4  | 79.2 | 1.1 
     : 
     :

即只有當單元格值> df_boundry [column] [0.8]時，我們用df_boundry [column] [0.8]代替它。

有沒有人知道我在這裏錯過了什麼？謝謝！

來源

2016-09-19 Edamame

你能發佈一個樣本數據集（5-7行）嗎？ – MaxU

只要你明白錯誤，df_in.iterrows（）就會返回一個（index，row）的元組。你可以通過在df_in.iterrows（）中執行'idx，row'來解決這個問題，但即使在你這樣做之後，row也是一個系列，所以'for行中的列'實際上是返回行中的每個值。嘗試在循環中打印一些變量以進一步探索它。 – shawnheide

UPDATE2：

In [7]: df_boundry 
Out[7]: 
    column_A column_B column_C 
0.0  0.1  0.4  0.0 
0.8  110.4  80.1  20.5 

In [8]: df_boundry.iloc[-1] 
Out[8]: 
column_A 110.4 
column_B  80.1 
column_C  20.5 
Name: 0.8, dtype: float64 

In [9]: df_boundry.iloc[[-1]] 
Out[9]: 
    column_A column_B column_C 
0.8  110.4  80.1  20.5

UPDATE：仍然使用相同的舊代碼，並提供更新的DF：

In [373]: df_boundry 
Out[373]: 
    column_A column_B column_C 
0.0  0.1  0.4  0.0 
0.8  110.4  80.1  20.5 

In [374]: df 
Out[374]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 
5  112.3  79.2  1.1 

In [375]: sav = df.copy() 

In [376]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 

In [377]: df 
Out[377]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 
5  110.4  79.2  1.1

OLD答案：

你可以做到這一點這個（矢量化）的方式：

In [350]: df 
Out[350]: 
    column_A column_B column_C 
0  0.5  0.5  NaN 
1  1.2  NaN  NaN 
2  NaN  8.1  1.1 
3  9.1  9.3  2.1 
4  4.5  1.1  1.4 

In [351]: df_boundry = df.dropna().quantile([0.0, .8]) 

In [352]: df_boundry 
Out[352]: 
    column_A column_B column_C 
0.0  4.50  1.10  1.40 
0.8  8.18  7.66  1.96 

In [353]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 

In [354]: df 
Out[354]: 
    column_A column_B column_C 
0  0.50  0.50  NaN 
1  1.20  NaN  NaN 
2  NaN  7.66  1.10 
3  8.18  7.66  1.96 
4  4.50  1.10  1.40

說明：

In [365]: df > df_boundry.iloc[-1] 
Out[365]: 
    column_A column_B column_C 
0 False False False 
1 False False False 
2 False  True False 
3  True  True  True 
4 False False False 

In [356]: df_boundry.iloc[[-1]] 
Out[356]: 
    column_A column_B column_C 
0.8  8.18  7.66  1.96 

In [357]: pd.concat([df_boundry.iloc[[-1]]] * len(df)) 
Out[357]: 
    column_A column_B column_C 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 
0.8  8.18  7.66  1.96 

In [358]: pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index) 
Out[358]: 
    column_A column_B column_C 
0  8.18  7.66  1.96 
1  8.18  7.66  1.96 
2  8.18  7.66  1.96 
3  8.18  7.66  1.96 
4  8.18  7.66  1.96

來源

2016-09-19 22:17:37 MaxU

我修改了我的問題，對不起，我以前可能沒有說清楚。只有當單元格值> df_boundry [column] [0.8]時，我們纔會替換它。謝謝！ – Edamame

@Edamame，這正是我的代碼正在做的。 'df_boundry.iloc [-1]' - 給你上邊界。我已將'df> df_boundry.iloc [-1]'的輸出添加到解釋部分 – MaxU

@Edamame中，當然，您可以使用'df_boundry.ix [[0.8]]'而不是'df_boundry.iloc [[ - 1]]'如果你想... – MaxU

而是計算所有位數前期的，你可以使用數據框的apply方法和每列分別操作。

def fill_with_quantile(col, q=0.80): 
    q_value = col.dropna().quantile(q) 
    col[col > q_value] = q_value 
    return col 

df_in.apply(lambda col: fill_with_quantile(col, 0.8), axis=0)

如果你想你也可以改變fill_with_quantile函數來填充上下極端值（即0.2和0.8）。

來源

2016-09-19 23:19:40 shawnheide

大熊貓：錯誤時迴路在給定的大熊貓行

回答

相關問題