從熊貓數據框中刪除重複，如果重複值是下一行

不同的專欄中，我有一個的大數據幀，看起來像這樣的格式：從熊貓數據框中刪除重複，如果重複值是下一行

term_x Intersections term_y 

boxers  1 briefs 

briefs  1 boxers 

babies  6 costumes 

costumes 6 babies 

babies  12 clothes 

clothes 12 babies 

babies  1 clothings 

clothings 1 babies

這個文件有超過幾百萬行。我想要做的是削減這些冗餘行。有什麼方法可以使用熊貓去重功能以快速和Pythonic的方式消除這些副本？我目前的做法涉及迭代整個數據框，讓下一行的值，然後刪除重複的線路，但這已被證明是非常緩慢：

row_iterator = duplicate_df_selfmerge.iterrows() 
_, next = row_iterator.__next__() # take first item from row_iterator 
for index, row in row_iterator: 
     if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']): 
      duplicate_df_selfmerge.drop(index, inplace=True) 
     next = row

來源

2017-05-03 GreenGodot

如何定義'重複'？你的例子有什麼你想要的輸出？ – Allen

另外你的例子沒有關鍵字列。 – IanS

你可以只把這兩列在一起，排序對，然後在這些分類對降排：

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))] 

df.drop_duplicates(subset=['together']) 
Out[11]: 
    term_x Intersections  term_y   together 
0 boxers    1  briefs  boxers,briefs 
2 babies    6 costumes babies,costumes 
4 babies    12 clothes babies,clothes 
6 babies    1 clothings babies,clothings

編輯：你說時間是這個問題的一個重要因素。下面是一些時序上的數據幀與200,000行比較礦山和阿倫的解決方案：

while df.shape[0] < 200000: 
    df.append(df) 

%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1) 
1 loop, best of 3: 6.62 s per loop 

%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))] 
10 loops, best of 3: 121 ms per loop

正如你所看到的，我的做法是超過98％的速度。在許多情況下，pandas.DataFrame.apply速度很慢。

來源

2017-05-03 13:09:48 blacksite

df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1}, 
'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies', 
    5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers', 
    2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}}) 

#create a column to combine team_x and team_y in a sorted order 
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1) 
#drop duplicates on the combined fields. 
df.drop_duplicates(subset='team_xy',inplace=True) 

df 
Out[916]: 
    Intersections term_x  term_y     team_xy 
0    1 boxers  briefs  ['boxers', 'briefs'] 
2    6 babies costumes ['babies', 'costumes'] 
4    12 babies clothes ['babies', 'clothes'] 
6    1 babies clothings ['babies', 'clothings']

來源

2017-05-03 13:12:33 Allen

從熊貓數據框中刪除重複，如果重複值是下一行

回答

相關問題