2016-12-16 83 views
3

我需要在熊貓數據框中查找重複行,然後添加一個帶有計數的額外列。比方說,我們有一個數據幀:獲取帶有原始索引的熊貓重複行數

>>print(df) 

+----+-----+-----+-----+-----+-----+-----+-----+-----+ 
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 
|----+-----+-----+-----+-----+-----+-----+-----+-----| 
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
+----+-----+-----+-----+-----+-----+-----+-----+-----+ 

上述幀隨後將與計數的附加列成爲下一個。您可以看到我們仍然保留索引列。

+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----| 
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 | 
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 | 
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 | 
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 | 
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 

我見過其他的解決方案,這如:

df.groupby(list(df.columns.values)).size() 

但是,返回與差距,並沒有初始指數的矩陣。

回答

4

可以使用reset_index先爲轉換index到列,然後通過firstlenaggregate

此外,如果被全部列需要GROUPBY需要刪除index列由difference

print (df.columns.difference(['index'])) 
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object') 

print (df.reset_index() 
     .groupby(df.columns.difference(['index']).tolist())['index'] 
     .agg(['first', 'size']) 
     .reset_index() 
     .set_index(['first']) 
     .sort_index() 
     .rename_axis(None)) 

    2 3 4 5 6 7 8 9 size 
0 0 0 0 0 0 0 0 0  2 
1 2 0 0 0 0 0 0 0  2 
2 2 4 3 4 1 1 4 4  1 
3 4 3 4 0 0 0 0 0  2 
4 2 3 4 3 4 0 0 0  1 
5 5 0 0 0 0 0 0 0  3 
6 4 5 0 0 0 0 0 0  1 
7 1 1 4 0 0 0 0 0  1 
10 3 3 4 3 5 5 5 0  1 
11 5 4 0 0 0 0 0 0  1 
13 0 4 0 0 0 0 0 0  1 
15 1 3 5 0 0 0 0 0  1 
16 4 0 0 0 0 0 0 0  1 
17 3 3 4 4 0 0 0 0  1 

如果有必要添加下一列10需要rename

#if necessary convert to str 
last_col = str(df.columns.astype(int).max() + 1) 
print (last_col) 
10 

print (df.reset_index() 
     .groupby(df.columns.difference(['index']).tolist())['index'] 
     .agg(['first', 'size']) 
     .reset_index() 
     .set_index(['first']) 
     .sort_index() 
     .rename_axis(None) 
     .rename(columns={'size':last_col})) 

    2 3 4 5 6 7 8 9 10 
0 0 0 0 0 0 0 0 0 2 
1 2 0 0 0 0 0 0 0 2 
2 2 4 3 4 1 1 4 4 1 
3 4 3 4 0 0 0 0 0 2 
4 2 3 4 3 4 0 0 0 1 
5 5 0 0 0 0 0 0 0 3 
6 4 5 0 0 0 0 0 0 1 
7 1 1 4 0 0 0 0 0 1 
10 3 3 4 3 5 5 5 0 1 
11 5 4 0 0 0 0 0 0 1 
13 0 4 0 0 0 0 0 0 1 
15 1 3 5 0 0 0 0 0 1 
16 4 0 0 0 0 0 0 0 1 
17 3 3 4 4 0 0 0 0 1 
+0

Thankyou..that工作得很好。 – kPow989

+0

很高興能幫到你! – jezrael