2017-02-22 80 views
1

我有一個有成千上萬行和20列的DataFrame。日期是索引,並有許多相同的日期。例如DF:基於groupby過濾熊貓數據框(只有前3)

  Stock Sales Data 1  Data 2 
1/1/2012 Apple 120  0.996691907 0.376607328 
1/1/2012 Apple 230  0.084699221 0.56433743 
1/1/2012 Apple 340  0.141253424 0.319522467 
1/1/2012 Berry 230  0.506264018  0.123657902 
1/1/2012 Berry 340  0.646633737  0.635841995 
1/1/2012 Cat  1250 0.204030887 0.928827628 
1/1/2012 Cat  850  0.556935133 0.81033956 
1/1/2012 Cat  650  0.771751177 0.988848472 
1/1/2012 Cat  650  0.615222763 0.468555772 
1/2/2012 Apple 1065 0.504410742 0.402553442 
1/2/2012 Apple 200  0.752335341 0.487556857 
1/2/2012 BlackBerry 1465 0.693017964 0.925737402 
1/2/2012 BlackBerry 2000 0.262392424 0.076542936 
1/2/2012 BlackBerry 1465 0.851841806 0.345077839 
1/2/2012 BlackBerry 1465 0.70635569 0.718340524 
1/2/2012 Tomato 700  0.911297224 0.155699549 
1/2/2012 Tomato 235  0.118843588 0.662083069 
1/2/2012 Carrot 500 0.07255267 0.585773563 

我要過濾的數據,使得每個日期和每個股票我只顯示最多3行,我在此基礎上擁有最大的銷售的那些選擇這3個。

如果在每個日期和庫存中只有1或2個實例,那麼它自然會保留所有行。

如果日期和庫存組有3行或更多行,那麼我只需要3行用於3個最大銷售額。如果有一個聯合的第三個位置(具有相同的銷售數字),我仍然只想要該日期和股票的MAXIMUM 3行,所以通過隨機選擇或任何其他合適的方法,我仍然會爲該股票吐出3行特定日期。

示例輸出可能是這樣的:

 Stock Sales Data 1  Data 2 
1/1/2012 Apple 120  0.996691907 0.376607328 
1/1/2012 Apple 230  0.084699221 0.56433743 
1/1/2012 Apple 340  0.141253424 0.319522467 
1/1/2012 Berry 230  0.506264018  0.123657902 
1/1/2012 Berry 340  0.646633737  0.635841995 
1/1/2012 Cat  1250 0.204030887 0.928827628 
1/1/2012 Cat  850  0.556935133 0.81033956 
1/1/2012 Cat  650  0.771751177 0.988848472 
1/2/2012 Apple 1065 0.504410742 0.402553442 
1/2/2012 Apple 200  0.752335341 0.487556857 
1/2/2012 BlackBerry 2000 0.262392424 0.076542936 
1/2/2012 BlackBerry 1465 0.851841806 0.345077839 
1/2/2012 BlackBerry 1465 0.70635569 0.718340524 
1/2/2012 Tomato 700  0.911297224 0.155699549 
1/2/2012 Tomato 235  0.118843588 0.662083069 
1/2/2012 Carrot 500 0.07255267 0.585773563 

回答

1

你可以只用groupbynlargest結合,以實現這一目標。

>>> data.groupby([data.index, data.Stock]).Sales.nlargest(3) 

      Stock    
1/1/2012 Apple  1/1/2012  340 
         1/1/2012  230 
         1/1/2012  120 
      Berry  1/1/2012  340 
         1/1/2012  230 
      Cat   1/1/2012 1250 
         1/1/2012  850 
         1/1/2012  650 
1/2/2012 Apple  1/2/2012 1065 
         1/2/2012  200 
      BlackBerry 1/2/2012 2000 
         1/2/2012 1465 
         1/2/2012 1465 
      Carrot  1/2/2012  500 
      Tomato  1/2/2012  700 
         1/2/2012  235 
Name: Sales, dtype: int64 

當然,如果你想輸出你的數據幀的全子集,而不是隻有相關的信息,我們可以使用iloc

>>> data.iloc[data.reset_index().groupby(['index', 'Stock']) 
           .Sales.nlargest(3).index.levels[2]] 

       Stock Sales  Data1  Data2 
1/1/2012  Apple 120 0.996692 0.376607 
1/1/2012  Apple 230 0.084699 0.564337 
1/1/2012  Apple 340 0.141253 0.319522 
1/1/2012  Berry 230 0.506264 0.123658 
1/1/2012  Berry 340 0.646634 0.635842 
1/1/2012   Cat 1250 0.204031 0.928828 
1/1/2012   Cat 850 0.556935 0.810340 
1/1/2012   Cat 650 0.771751 0.988848 
1/2/2012  Apple 1065 0.504411 0.402553 
1/2/2012  Apple 200 0.752335 0.487557 
1/2/2012 BlackBerry 1465 0.693018 0.925737 
1/2/2012 BlackBerry 2000 0.262392 0.076543 
1/2/2012 BlackBerry 1465 0.851842 0.345078 
1/2/2012  Tomato 700 0.911297 0.155700 
1/2/2012  Tomato 235 0.118844 0.662083 
1/2/2012  Carrot 500 0.072553 0.585774 
+0

謝謝你的客氣話。 – piRSquared

0

使用sort_values(),groupby()和head()似乎會產生您正在查找的結果。

import pandas as pd 

df = pd.read_table('fruit', sep='\s+') 
df.Date = pd.to_datetime(df.Date) 

df.sort_values(by=['Date', 'Stock', 'Sales'], 
       ascending=[True, True, False], 
       inplace=True) 

#    Date  Stock Sales  Data1  Data2 
# 2 2012-01-01  Apple 340 0.141253 0.319522 
# 1 2012-01-01  Apple 230 0.084699 0.564337 
# 0 2012-01-01  Apple 120 0.996692 0.376607 
# 4 2012-01-01  Berry 340 0.646634 0.635842 
# 3 2012-01-01  Berry 230 0.506264 0.123658 
# 5 2012-01-01   Cat 1250 0.204031 0.928828 
# 6 2012-01-01   Cat 850 0.556935 0.810340 
# 7 2012-01-01   Cat 650 0.771751 0.988848 
# 8 2012-01-01   Cat 650 0.615223 0.468556 
# 9 2012-01-02  Apple 1065 0.504411 0.402553 
# 10 2012-01-02  Apple 200 0.752335 0.487557 
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543 
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737 
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078 
# 14 2012-01-02 BlackBerry 1465 0.706356 0.718341 
# 17 2012-01-02  Carrot 500 0.072553 0.585774 
# 15 2012-01-02  Tomato 700 0.911297 0.155700 
# 16 2012-01-02  Tomato 235 0.118844 0.662083 



df.groupby(by=['Date','Stock'], as_index=False, sort=False).head(3) 

print df 

#    Date  Stock Sales  Data1  Data2 
# 2 2012-01-01  Apple 340 0.141253 0.319522 
# 1 2012-01-01  Apple 230 0.084699 0.564337 
# 0 2012-01-01  Apple 120 0.996692 0.376607 
# 4 2012-01-01  Berry 340 0.646634 0.635842 
# 3 2012-01-01  Berry 230 0.506264 0.123658 
# 5 2012-01-01   Cat 1250 0.204031 0.928828 
# 6 2012-01-01   Cat 850 0.556935 0.810340 
# 7 2012-01-01   Cat 650 0.771751 0.988848 
# 9 2012-01-02  Apple 1065 0.504411 0.402553 
# 10 2012-01-02  Apple 200 0.752335 0.487557 
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543 
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737 
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078 
# 17 2012-01-02  Carrot 500 0.072553 0.585774 
# 15 2012-01-02  Tomato 700 0.911297 0.155700 
# 16 2012-01-02  Tomato 235 0.118844 0.662083