如何獲得熊貓數據框中的行，並在列中保留最大值並保留原始索引？

我有一個熊貓數據框。在第一列中，它可以有多次相同的值（換句話說，第一列中的值不是唯一的）。如何獲得熊貓數據框中的行，並在列中保留最大值並保留原始索引？

每當我有多行在第一列中包含相同的值，我想只留下那些在第三列中具有最大值的行。我幾乎找到了解決辦法：

import pandas 

ls = [] 
ls.append({'c1':'a', 'c2':'a', 'c3':1}) 
ls.append({'c1':'a', 'c2':'c', 'c3':3}) 
ls.append({'c1':'a', 'c2':'b', 'c3':2}) 
ls.append({'c1':'b', 'c2':'b', 'c3':10}) 
ls.append({'c1':'b', 'c2':'c', 'c3':12}) 
ls.append({'c1':'b', 'c2':'a', 'c3':7}) 

df = pandas.DataFrame(ls, columns=['c1','c2','c3']) 
print df 
print '--------------------' 
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

結果我得到：

c1 c2 c3 
0 a a 1 
1 a c 3 
2 a b 2 
3 b b 10 
4 b c 12 
5 b a 7 
-------------------- 
    c1 c2 c3 
c1   
a a c 3 
b b c 12

我的問題是，我不希望有c1爲指標。我想有是繼：

c1 c2 c3 
1 a c 3 
4 b c 12

來源

2013-12-20 Roman

當調用df.groupby(...).apply(foo)，物品進行foo返回的類型會影響結果一起融合在一起的方式。

如果您返回一個系列，系列的索引將成爲最終結果的列，並且groupby關鍵字將成爲索引（有點令人費解）。

如果您返回一個DataFrame，則最終結果將使用DataFrame的索引作爲索引值，並將DataFrame的列作爲列（非常合理）。

因此，您可以通過將Series轉換爲DataFrame來安排所需的輸出類型。

隨着熊貓0.13可以使用to_frame().T方法：

def maxrow(x, col): 
    return x.loc[x[col].argmax()].to_frame().T 

result = df.groupby('c1').apply(maxrow, 'c3') 
result = result.reset_index(level=0, drop=True) 
print(result)

產生

c1 c2 c3 
1 a c 3 
4 b c 12

在熊貓0.12或以上，相當於將是：

def maxrow(x, col): 
    ser = x.loc[x[col].idxmax()] 
    df = pd.DataFrame({ser.name: ser}).T 
    return df

順便說一下，對於小型DataFrame，比我的要快。然而，sort將時間複雜度從O(n)提升到O(n log n)，因此當應用於較大的數據幀時，它會比上面顯示的to_frame解決方案慢。

這是我如何基準是：

import pandas as pd 
import numpy as np 
import timeit 


def reset_df_first(df): 
    df2 = df.reset_index() 
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()]) 
    result.set_index(['index'], inplace=True) 
    return result 

def maxrow(x, col): 
    result = x.loc[x[col].argmax()].to_frame().T 
    return result 

def using_to_frame(df): 
    result = df.groupby('c1').apply(maxrow, 'c3') 
    result.reset_index(level=0, drop=True, inplace=True) 
    return result 

def using_sort(df): 
    return df.sort('c3').groupby('c1', as_index=False).tail(1) 


for N in (100, 1000, 2000): 
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'}, 
         'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'}, 
         'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}}) 

    df = pd.concat([df]*N) 
    df.reset_index(inplace=True, drop=True) 

    timing = dict() 
    for func in (reset_df_first, using_to_frame, using_sort): 
     timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__), 
           'import __main__ as m ', 
           number=10) 

    print('For N = {}'.format(N)) 
    for func in sorted(timing, key=timing.get): 
     print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func])) 
    print

產生

For N = 100 
using_sort   : 0.018 
using_to_frame  : 0.0265 
reset_df_first  : 0.0303 

For N = 1000 
using_to_frame  : 0.0358 \ 
using_sort   : 0.036 /this is roughly where the two methods cross over in terms of performance 
reset_df_first  : 0.0432 

For N = 2000 
using_to_frame  : 0.0457 
reset_df_first  : 0.0523 
using_sort   : 0.0569

（reset_df_first是另一種可能性我試過）

來源

2013-12-20 12:54:12 unutbu

它將從[pandas 0.13]開始工作（https://github.com/pydata/pandas/pull/5164），在舊版本中，Series沒有'to_frame'功能。 – alko

@alko：感謝您的提升。我已經添加了與0.12版或更早版本兼容的等效代碼。 – unutbu

試試這個：

df.sort('c3').groupby('c1', as_index=False).tail(1)

來源

2013-12-20 12:33:01

我不能強迫自己投了PEP8違反代碼;但是爲了得到OP所期望的結果，你可能需要添加'.reset_index（level = 0，drop = True）' – alko

如何獲得熊貓數據框中的行，並在列中保留最大值並保留原始索引？

回答

相關問題