我認爲你可以使用groupby
與filter
- 條件是 - 不是2
行有重複值,並在組列message
isin
沒有重視T
或X
:
import pandas as pd
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0" ,"BB-2", "BB-2"],
"symbol":["A","A","C","B","B"],
"date":["06/24/2014","06/24/2014","06/20/2013","06/25/2015","06/25/2015"],
"message": ["T","X","T","",""] })
print (df)
ID date message symbol
0 AA-1 06/24/2014 T A
1 AA-1 06/24/2014 X A
2 C-0 06/20/2013 T C
3 BB-2 06/25/2015 B
4 BB-2 06/25/2015 B
df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~((len(x) == 2) &
(x.message.isin(['T','X']).all())))
print (df1)
ID date message symbol
2 C-0 06/20/2013 T C
3 BB-2 06/25/2015 B
4 BB-2 06/25/2015 B
Filtration in docs。
EDIT通過comment:
import pandas as pd
df = pd.DataFrame({"ID":["AA-1", "AA-1", "C-0", "C-0","BB-2", "BB-2"],
"symbol":["A","A","C","C", "B","B"],
"date":["06/24/2014","06/24/2014","06/20/2013","06/20/2013","06/25/2015","06/25/2015"],
"message": ["T","X","X","X","",""] })
print (df)
ID date message symbol
0 AA-1 06/24/2014 T A
1 AA-1 06/24/2014 X A
2 C-0 06/20/2013 X C
3 C-0 06/20/2013 X C
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
如果需要,每組X
或T
刪除值 - 這意味着它除去雙X
或雙T
太和每個組的每個len
總是2
:
df1 = df.groupby(['ID','date','symbol']).filter(lambda x: ~x.message.isin(['T','X']).all())
print (df1)
ID date message symbol
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
如果需要刪除只有值爲T
和X
的組,可以首先通過檢查每個組中的第一個值是T
和第二個X
,通過message
然後filter
。 ( 'T' 是第一和X
是第二,因爲排序):
df2 = df.sort_values('message')
.groupby(['ID','date','symbol'], sort=False)
.filter(lambda x: ((x.message.iloc[0] != 'T') | (x.message.iloc[1] != 'X')))
print (df2)
ID date message symbol
4 BB-2 06/25/2015 B
5 BB-2 06/25/2015 B
2 C-0 06/20/2013 X C
3 C-0 06/20/2013 X C
只是爲了澄清 - 你是否想保留重複行以防萬一有2個以上? – Stefan
我應該在我的問題中更清楚地說明。我的數據是成對的。對於每個「X」行,除了「消息」列之外,(或至少應該是)恰好一個「T」行與其他列相等。在這種情況下,至少在數據收集正確的情況下,應該只有一對匹配的觀測值。 – dleal