2016-02-18 64 views
0

我有一個DataFrame有5列。我分組前4列,並計算第5列的平均值,標準和數量。在Python中聚合之前刪除離羣值熊貓

我做這個用下面的代碼:

df.groupby(['col1','col2','col3','col4']).agg([np.mean, np.std, len]) 

現在我的問題,我有一個函數,用平均值代替異常值。我怎樣才能放棄那些異常的行?

def replace(group): 
    mean, std = group.mean(), group.std() 
    outliers = (group - mean).abs() > 3*std 
    group[outliers] = mean   
    return group 

    df.groupby(['col1','col2','col3','col4']).transform(replace) 

第二個問題,

當我嘗試結合了改造和AGG,我有以下錯誤:

df.groupby(['col1','col2','col3','col4']).transform(replace).agg([np.mean, np.std, len]) 

AttributeError: 'DataFrame' object has no attribute 'agg' 
+0

我認爲你可以使用'df.groupby([ 'COL1', 'COL2', 'COL3', 'COL4'] )。應用(替換)'而不是'df.groupby(['col1','col2','col3','col4'])。變換(替換)' – jezrael

+0

錯誤很明顯,你正在返回結果初始'transform'並且在這個上調用'agg','agg'只能用於'groupby'對象。對於你的第一步,你可以不僅僅在值是異常值時分配「真/假」,然後將它們作爲後處理步驟進行過濾? – EdChum

+0

@EdChum請問這項工作'group.drop(group [outliers],inplace = True)' – Tasos

回答

1

transform()返回DataFrame尚未agg()方法,你需要再次調用groupby()方法。或者您可以保存groupby對象,並重用它的grouper屬性。

要刪除異常值,需要撥打電話apply()並獲取布爾序列mask,然後選擇行並再次調用groupby()

import pandas as pd 
import numpy as np 

N = 10000 
df = pd.DataFrame(np.random.randint(0, 5, size=(N, 4)), columns=["c1", "c2", "c3", "c4"]) 
df["c5"] = np.random.randn(N) 

def replace(group): 
    mean, std = group.mean(), group.std() 
    inliers = (group - mean).abs() <= 2*std 
    return group.where(inliers, mean) 

def drop(group): 
    mean, std = group.mean(), group.std() 
    inliers = (group - mean).abs() <= 2*std 
    return inliers 

g = df.groupby(['c1','c2','c3','c4']) 

s1 = g.c5.transform(replace) 
res1 = s1.groupby(g.grouper).agg([np.mean, np.std, len]) 

mask = g.c5.apply(drop) 
res2 = df[mask].groupby(['c1','c2','c3','c4']).c5.agg([np.mean, np.std, len]) 

您還可以計算的回調函數的總比分

def func(group): 
    mean, std = group.mean(), group.std() 
    inliers = (group - mean).abs() <= 2*std 
    tmp = group[inliers] 
    return {"mean":tmp.mean(), "std":tmp.std(), "len":tmp.shape[0]} 

g.c5.apply(func).unstack()