如何刪除熊貓數據框中的特定重複行？

在這個大熊貓據幀：如何刪除熊貓數據框中的特定重複行？

df = 

pos index data 
21  36 a,b,c 
21  36 a,b,c 
23  36 c,d,e 
25  36 f,g,h 
27  36 g,h,k 
29  39 a,b,c 
29  39 a,b,c 
31  39 . 
35  39 c,k 
36  41 g,h 
38  41 k,l 
39  41 j,k 
39  41 j,k

我想刪除重複的行只在同一指標組中，當他們在子幀的頭部區域。

所以，我所做的：

df_grouped = df.groupby(['index'], as_index=True)

現在，

for i, sub_frame in df_grouped: 
    subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)

我想申請這個方法，因爲有些pos值將在不應該被刪除的尾部區域重複。

有任何建議。

預期輸出：

pos index data 
removed 
21  36 a,b,c 
23  36 c,d,e 
25  36 f,g,h 
27  36 g,h,k 
removed 
29  39 a,b,c 
31  39 . 
35  39 c,k 
36  41 g,h 
38  41 k,l 
39  41 j,k 
39  41 j,k

來源

2017-03-20 everestial007

什麼'df.drop_duplicates（）'在http://stackoverflow.com/questions/23667369/drop-all-duplicate -row-in-python-pandas？ – Craig

一個簡單的'拖放函數可以工作'，但我只想在重複位於'子幀'的頭部區域（按索引值分組）時放棄它。這是主要問題。 – everestial007

@克雷格：我只是看了一下這個例子，它不起作用。在做groupby之後，我不得不在每個「subframe」中指定行（但可能有其他方法）。而且，只有一個副本不需要被放置在子幀的頭部區域（頂部兩行）中。 – everestial007

如果沒有在一個單一的應用語句來完成，那麼這段代碼將只刪除重複的頭部區域：

data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39], 
     'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41], 
     'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k'] 
} 

df = pd.DataFrame(data) 

accum = [] 
for i, sub_frame in df.groupby('idx'): 
    accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]])) 

df2 = pd.concat(accum) 

print(df2)

EDIT2：我發佈的鏈接命令的第一個版本是錯誤的，而且僅適用於示例數據。該版本提供了更通用的解決方案，以每OP的要求刪除重複行：

df.drop(df.groupby('idx')   # group by the index column 
      .head(2)    # select the first two rows 
      .duplicated()   # create a Series with True for duplicate rows 
      .to_frame(name='duped') # make the Series a dataframe 
      .query('duped')   # select only the duplicate rows 
      .index)     # provide index of duplicated rows to drop

來源

2017-03-20 02:15:30 Craig

如何刪除熊貓數據框中的特定重複行？

回答

相關問題