2016-06-09 11 views
0

我試圖找出最大(First_Word, Group)大熊貓的GroupBy兩個文本列,返回基於計數的最大行數

import pandas as pd 

df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
      'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
      'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
      columns=['First_Word', 'Group', 'Text']) 

    First_Word   Group      Text 
0  apple apple bins  where to buy apple bins 
1  apple apple trees   i see an apple tree 
2  orange orange juice   i like orange juice 
3  apple apple trees apple fell out of the tree 
4  pear  pear tree  partrige in a pear tree 

然後我做了groupby

grouped = df.groupby(['First_Word', 'Group']).count() 
         Text 
First_Word Group    
apple  apple bins  1 
      apple trees  2 
orange  orange juice  1 
pear  pear tree  1 
現在

我希望將其過濾爲僅具有最大Text計數的唯一索引行。下面您會注意到apple bins已被移除,因爲apple trees具有最大值。

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1 

max value of group問題是類似的,但是當我嘗試這樣的事:

df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()]) 

我得到一個錯誤:KeyError: ('Text', 'occurred at index Text')。如果我添加axis=1apply我得到IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')

回答

2

鑑於grouped,你現在要由First Word指數級組,並找到最大行的索引標籤爲每個組(使用idxmax):

In [39]: grouped.groupby(level='First_Word')['Text'].idxmax() 
Out[39]: 
First_Word 
apple  (apple, apple trees) 
orange (orange, orange juice) 
pear   (pear, pear tree) 
Name: Text, dtype: object 

然後,您可以使用grouped.loc通過索引標籤選擇grouped行:

import pandas as pd 
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
    'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
    'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
    columns=['First_Word', 'Group', 'Text']) 

grouped = df.groupby(['First_Word', 'Group']).count() 
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()] 
print(result) 

產量

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1