2016-12-02 49 views
1

我有一個數據框,每個組ID有+ - 100行。我想對組ID進行分組,然後只保留一列的標準差低於閾值的組。我用下面的代碼熊貓:如何選擇組內標準偏差小的組?

# df is the dataframe with all rows 
# group on groupID 
df_grouped = df.groupby('groupID') 

# this gives a table with groupID and the std within a group 
df_grouped_std = df_grouped.std() 

# from the df with standard deviations, I select only the groups 
# where the standard deviation is withing limits 
selection = df_grouped_std[df_grouped_std['col1']<1][df_grouped_std['col2']<0.05] 

# now I try to select from the original dataframe 'df_grouped' the groups that were selected in the previous step. 
df_plot = df_grouped[selection] 

堆棧跟蹤:

Traceback (most recent call last): 

    File "<ipython-input-72-2cd045ecb262>", line 1, in <module> 
    runfile('C:/Documents and Settings/a708818/Desktop/coloredByRol.py', wdir='C:/Documents and Settings/a708818/Desktop') 

    File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile 
    execfile(filename, namespace) 

    File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile 
    exec(compile(scripttext, filename, 'exec'), glob, loc) 

    File "C:/Documents and Settings/a708818/Desktop/coloredByRol.py", line 50, in <module> 
    df_plot = df_grouped[selection] 

    File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 3170, in __getitem__ 
    if key not in self.obj: 

    File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 688, in __contains__ 
    return key in self._info_axis 

    File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 885, in __contains__ 
    hash(key) 

    File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 647, in __hash__ 
    ' hashed'.format(self.__class__.__name__)) 

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashedus they cannot be hashed 

我無法弄清楚如何選擇我想要的數據。任何提示?

回答

1

我認爲你可以使用:

df_grouped = df.groupby('groupID') 
#get std per groups 
df_grouped_std = df_grouped.std() 
print (df_grouped_std) 
#select by conditions 
selection = df_grouped_std[ (df_grouped_std['col1']<1) & (df_grouped_std['col2']<0.05)] 
print (selection) 

#select all rows of original df where groupID is same as index of 'selection' 
df_plot = df[df.groupID.isin(selection.index)] 
print (df_plot) 

樣品:

df = pd.DataFrame({'groupID':[1,1,1,2,3,3,2], 
        'col1':[5,3,6,4,7,8,9], 
        'col2':[7,8,9,1,2,3,8]}) 

print (df) 
    col1 col2 groupID 
0  5  7  1 
1  3  8  1 
2  6  9  1 
3  4  1  2 
4  7  2  3 
5  8  3  3 
6  9  8  2 
df_grouped = df.groupby('groupID') 
# 
df_grouped_std = df_grouped.std() 
print (df_grouped_std) 
      col1  col2 
groupID      
1  1.527525 1.000000 
2  3.535534 4.949747 
3  0.707107 0.707107 

#change conditions for testing only 
selection = df_grouped_std[ (df_grouped_std['col1']>1) & (df_grouped_std['col2']>3)] 
print (selection) 
      col1  col2 
groupID      
2  3.535534 4.949747 

# 
df_plot = df[df.groupID.isin(selection.index)] 
print (df_plot) 
    col1 col2 groupID 
3  4  1  2 
6  9  8  2 

編輯:

另一種可能的解決方案是使用filter

print (df.groupby('groupID') 
     .filter(lambda x: (x.col1.std() > 1) & (x.col2.std() > 3))) 

    col1 col2 groupID 
3  4  1  2 
6  9  8  2 
+0

使用過濾器的解決方案看起來更清潔。謝謝! – marqram