2017-05-03 177 views
0

不同的專欄中,我有一個的大數據幀,看起來像這樣的格式:從熊貓數據框中刪除重複,如果重複值是下一行

term_x Intersections term_y 

boxers  1 briefs 

briefs  1 boxers 

babies  6 costumes 

costumes 6 babies 

babies  12 clothes 

clothes 12 babies 

babies  1 clothings 

clothings 1 babies 

這個文件有超過幾百萬行。我想要做的是削減這些冗餘行。有什麼方法可以使用熊貓去重功能以快速和Pythonic的方式消除這些副本?我目前的做法涉及迭代整個數據框,讓下一行的值,然後刪除重複的線路,但這已被證明是非常緩慢:

row_iterator = duplicate_df_selfmerge.iterrows() 
_, next = row_iterator.__next__() # take first item from row_iterator 
for index, row in row_iterator: 
     if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']): 
      duplicate_df_selfmerge.drop(index, inplace=True) 
     next = row 
+1

如何定義'重複'?你的例子有什麼你想要的輸出? – Allen

+1

另外你的例子沒有關鍵字列。 – IanS

回答

1

你可以只把這兩列在一起,排序對,然後在這些分類對降排:

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))] 

df.drop_duplicates(subset=['together']) 
Out[11]: 
    term_x Intersections  term_y   together 
0 boxers    1  briefs  boxers,briefs 
2 babies    6 costumes babies,costumes 
4 babies    12 clothes babies,clothes 
6 babies    1 clothings babies,clothings 

編輯:你說時間是這個問題的一個重要因素。下面是一些時序上的數據幀與200,000行比較礦山和阿倫的解決方案:

while df.shape[0] < 200000: 
    df.append(df) 

%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1) 
1 loop, best of 3: 6.62 s per loop 

%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))] 
10 loops, best of 3: 121 ms per loop 

正如你所看到的,我的做法是超過98%的速度。在許多情況下,pandas.DataFrame.apply速度很慢。

1
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1}, 
'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies', 
    5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers', 
    2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}}) 

#create a column to combine team_x and team_y in a sorted order 
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1) 
#drop duplicates on the combined fields. 
df.drop_duplicates(subset='team_xy',inplace=True) 

df 
Out[916]: 
    Intersections term_x  term_y     team_xy 
0    1 boxers  briefs  ['boxers', 'briefs'] 
2    6 babies costumes ['babies', 'costumes'] 
4    12 babies clothes ['babies', 'clothes'] 
6    1 babies clothings ['babies', 'clothings']