熊貓：從數據框中刪除相反的副本

我有一個數據框有兩列，A和B。在這種情況下，A和B的順序並不重要;例如，我會認爲(0,50)和(50,0)是重複的。在熊貓中，從數據框中刪除這些重複項的有效方法是什麼？熊貓：從數據框中刪除相反的副本

import pandas as pd 

# Initial data frame. 
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
        'B': [50, 22, 35, 5, 10, 11, 21, 0]}) 
data 
    A B 
0 0 50 
1 10 22 
2 11 35 
3 21 5 
4 22 10 
5 35 11 
6 5 21 
7 50 0 

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
         'B': [50, 21, 22, 35]}) 
data2 
    A B 
0 0 50 
1 5 21 
2 10 22 
3 11 35

理想情況下，輸出將按列A的值排序。

來源

2016-11-07 Adam

可以丟棄重複之前排序的數據幀中的每一行：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates() 

# A B 
#0 0 50 
#1 10 22 
#2 11 35 
#3 5 21

如果你喜歡的結果通過A列進行排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A') 

# A B 
#0 0 50 
#3 5 21 
#1 10 22 
#2 11 35

來源

2016-11-07 21:22:43 Psidom

不需要lambda，'.apply（sorted，axis = 1）'將會工作。 – root

@root。那就對了。一個更好的選擇。 – Psidom

我喜歡這個答案！我想過的每件事都包含堆疊到數據框。這聰明消除了這種需要。 – piRSquared

這裏是有點難看，但更快的解決方案：

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates() 
Out[44]: 
    A B 
0 0 50 
1 10 22 
2 11 35 
3 5 21

定時：對於8K行DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True) 

In [51]: big.shape 
Out[51]: (8000, 2) 

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates() 
1 loop, best of 3: 3.04 s per loop 

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates() 
100 loops, best of 3: 3.96 ms per loop 

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates() 
1 loop, best of 3: 2.69 s per loop

來源

2016-11-07 21:30:42 MaxU

這是矢量化實現的相同答案。不！醜陋的:-) – piRSquared

熊貓：從數據框中刪除相反的副本

回答

相關問題