2016-07-29 100 views
1

我想通過保留具有最高值的值來刪除數據集中的重複項。我現在用的大熊貓:如何刪除數據集中的重複值:python

c_maxes = hospProfiling.groupby(['Hospital_ID', 'District_ID'], group_keys=False)\ 
       .apply(lambda x: x.ix[x['Hospital_employees'].idxmax()]) 
print c_maxes 

c_maxes.to_csv('data/external/HospitalProfilingMaxes.csv') 

這樣做,這是導致初始數據集:Hospital_ID,District_ID,Hospital_employees成爲Hospital_ID,District_ID,Hospital_ID,District_ID,Hospital_employees

用於分組的列正在被複制。這裏有什麼錯誤?

編輯:

在使用GROUPBY()函數,在數據的開頭一個額外的列被添加。該列沒有名稱,它只是所有行的序列號。這在問題的輸出第二個答案中顯示。我想刪除這個額外的列,因爲我不需要它。我嘗試這樣做:

hospProfiling.drop(hospProfiling.columns[0], axis=1)

此代碼不會刪除列。它如何被刪除?

回答

3

爲什麼不使用GROUPBY max方法?

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max() 

如果你碰巧有三列以上,由總比分取代最大:

hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).agg({'Hospital employees': max}) 
1

我想你需要:

hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'] 
           .idxmax()] 

我感到非常驚訝與另一個答案,我做了一些研究,如果功能idxmax是無用或不:

樣品:

hospProfiling = pd.DataFrame({'Hospital_ID': {0: 'A', 1: 'A', 2: 'B', 3: 'A', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'B', 9: 'B', 10: 'A', 11: 'B', 12: 'A'}, 'Name': {0: 'Sam', 1: 'Annie', 2: 'Fred', 3: 'Sam', 4: 'Annie', 5: 'Fred', 6: 'Sam', 7: 'Annie', 8: 'Fred', 9: 'James', 10: 'Alan', 11: 'Julie', 12: 'Greg'}, 'District_ID': {0: 'M', 1: 'F', 2: 'M', 3: 'M', 4: 'F', 5: 'M', 6: 'M', 7: 'F', 8: 'M', 9: 'M', 10: 'M', 11: 'F', 12: 'M'}, 'Hospital_employees': {0: 25, 1: 41, 2: 70, 3: 44, 4: 12, 5: 14, 6: 20, 7: 10, 8: 30, 9: 18, 10: 56, 11: 28, 12: 33}, 'Val': {0: 100, 1: 7, 2: 14, 3: 200, 4: 5, 5: 20, 6: 1, 7: 0, 8: 7, 9: 9, 10: 6, 11: 9, 12: 47}}) 
hospProfiling = hospProfiling[['Hospital_ID','District_ID','Hospital_employees','Val','Name']] 
hospProfiling.sort_values(by=['Hospital_ID','District_ID'], inplace=True) 
print (hospProfiling) 
    Hospital_ID District_ID Hospital_employees Val Name 
1   A   F     41 7 Annie 
4   A   F     12 5 Annie 
7   A   F     10 0 Annie 
0   A   M     25 100 Sam 
3   A   M     44 200 Sam 
6   A   M     20 1 Sam 
10   A   M     56 6 Alan 
12   A   M     33 47 Greg 
11   B   F     28 9 Julie 
2   B   M     70 14 Fred 
5   B   M     14 20 Fred 
8   B   M     30 7 Fred 
9   B   M     18 9 James 

主要區別在於如何處理另一列,如果使用max它會從每列返回最大值 - h ERE Hospital_employeesVal

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max() 
print (c_maxes) 
    Hospital_ID District_ID Hospital_employees Name Val 
0   A   F     41 Annie 7 
1   A   M     56 Sam 200 
2   B   F     28 Julie 9 
3   B   M     70 James 20 

c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False) 
         .agg({'Hospital_employees': max}) 
print (c_maxes) 
    Hospital_ID District_ID Hospital_employees 
0   A   F     41 
1   A   M     56 
2   B   F     28 
3   B   M     70 

功能idxmax回報另一列最大值的指標:

print (hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'].idxmax()) 
A   F    1 
      M    10 
B   F    11 
      M    2 
Name: Hospital_employees, dtype: int64 

然後你只loc選擇DataFrame

c_maxes = hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'] 
         .idxmax()] 
print (c_maxes) 
    District_ID Hospital_ID Hospital_employees Name Val 
1   F   A     41 Annie 7 
10   M   A     56 Alan 6 
11   F   B     28 Julie 9 
2   M   B     70 Fred 14