2017-02-03 44 views
0

鑑於問題groupby()nlargest()如上所述herehere。我正在努力解決這些問題。切片原始DF()。nlargest(x)的操作

注意:爲簡單起見,我使用nlargest(1),但是,它可以是任意數量的選擇。

{'city1': {0: 'Chicago', 
    1: 'Chicago', 
    2: 'Chicago', 
    3: 'Chicago', 
    4: 'Miami', 
    5: 'Houston', 
    6: 'Austin'}, 
'city2': {0: 'Toronto', 
    1: 'Detroit', 
    2: 'St.Louis', 
    3: 'Miami', 
    4: 'Dallas', 
    5: 'Dallas', 
    6: 'Dallas'}, 
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0}, 
'plant1_type': {0: 'COMBCYCL', 
    1: 'COMBCYCL', 
    2: 'NUKE', 
    3: 'COAL', 
    4: 'NUKE', 
    5: 'COMBCYCL', 
    6: 'COAL'}, 
'plant2_type': {0: 'COAL', 
    1: 'COAL', 
    2: 'COMBCYCL', 
    3: 'COMBCYCL', 
    4: 'COAL', 
    5: 'NUKE', 
    6: 'NUKE',}} 

A)GROUPBY city1並返回從原始選擇的行DF

cols2 = ['city1','plant1_type','plant2_type'] 
df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3] 

    city1 city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0 COAL  NUKE 
3 Chicago Miami  0.5 COAL  COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL COAL 
2 Chicago St.Louis  2.0 NUKE  COMBCYCL 
5 Houston Dallas  4.0 COMBCYCL NUKE 
4 Miami Dallas  1.0 NUKE  COAL 

上面看起來不錯

B)GROUPBY city2並返回從原始DF

選定的行由於#A中使用的相同代碼在嘗試groupby city2時會生成僞造結果,建議採取解決方法以下內容:

cols = ['city2','plant1_type','plant2_type'] 
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1) 


city2  plant1_type plant2_type 
Toronto COMBCYCL  COAL   5.0 
Detroit COMBCYCL  COAL   4.0 
St.Louis NUKE   COMBCYCL  2.0 
Miami  COAL   COMBCYCL  0.5 
Dallas NUKE   COAL   1.0 
      COMBCYCL  NUKE   4.0 
      COAL   NUKE   3.0 

現在怎麼辦我用這個結果返回從原來選擇的行DF正如我在#A做?

:有原始的DF有一個附加行,對於city2具有基團由groupby.nlargest()結果,其中至少一個組具有尺寸小於1,則在#A的代碼可以用於#B更大。

回答

2

除非我錯過了一些東西(我同意這裏有潛伏在熊貓代碼中的錯誤),我們可以相對簡單地繞過任何困難。

方法1:使用locidxmax

In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()] 
Out[21]: 
    city1  city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0  COAL  NUKE 
3 Chicago  Miami  0.5  COAL COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL  COAL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
4 Miami Dallas  1.0  NUKE  COAL 

In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()] 
Out[22]: 
    city1  city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0  COAL  NUKE 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
4 Miami Dallas  1.0  NUKE  COAL 
1 Chicago Detroit  4.0 COMBCYCL  COAL 
3 Chicago  Miami  0.5  COAL COMBCYCL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL  COAL 

方法2:排序p234_r_c和使用last

In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last() 
Out[17]: 
    city1 plant1_type plant2_type  city2 p234_r_c 
0 Austin  COAL  NUKE Dallas  3.0 
1 Chicago  COAL COMBCYCL  Miami  0.5 
2 Chicago COMBCYCL  COAL Toronto  5.0 
3 Chicago  NUKE COMBCYCL St.Louis  2.0 
4 Houston COMBCYCL  NUKE Dallas  4.0 
5 Miami  NUKE  COAL Dallas  1.0 

In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last() 
Out[18]: 
     city2 plant1_type plant2_type city1 p234_r_c 
0 Dallas  COAL  NUKE Austin  3.0 
1 Dallas COMBCYCL  NUKE Houston  4.0 
2 Dallas  NUKE  COAL Miami  1.0 
3 Detroit COMBCYCL  COAL Chicago  4.0 
4  Miami  COAL COMBCYCL Chicago  0.5 
5 St.Louis  NUKE COMBCYCL Chicago  2.0 
6 Toronto COMBCYCL  COAL Chicago  5.0 

如果你希望能夠得到多反應也是如此,儘管最小和最小的都被破壞了,但我認爲最簡單的方法是排序然後使用頭部或尾部。例如:

In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2) 
Out[27]: 
    city1  city2 p234_r_c plant1_type plant2_type 
3 Chicago  Miami  0.5  COAL COMBCYCL 
4 Miami Dallas  1.0  NUKE  COAL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
6 Austin Dallas  3.0  COAL  NUKE 
1 Chicago Detroit  4.0 COMBCYCL  COAL 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
0 Chicago Toronto  5.0 COMBCYCL  COAL 
+0

說,如果我用'方法#1'和做僅使用'COLS = [ 'city1']''一個和groupby'希望'最大2(或N)p234_r_c'。我用'N = 2'嘗試了以下內容,結果與'N = 1'相同。 'df.loc [df.groupby(cols2)[「p234_r_c」]。idxmax(2)]' – codingknob

+0

對於'N = 2',我們應該有2行芝加哥。即下面一行丟失:'芝加哥\t底特律\t 4.0 \t COMBCYCL \t COAL' – codingknob

+0

@codingknob:'idxmax'沒有'n'參數,所以如果有文檔在某處暗示它,請提交一個錯誤,因爲我們需要修理它。 :-( – DSM