同樣的觀察在多個列

我有一個數據幀，看起來像這樣同樣的觀察在多個列

ID1  ID2 variables 
    a  b  something 
    b  g  something 
    c  h  something 
    d  i  something 
    a  h  something

如果一個ID顯示了在兩個ID1和ID2我希望從數據集排除ID1這個值的觀察。因此，在這種情況下

ID1  ID2 variables 
    a  b  something  
    c  h  something 
    d  i  something 
    a  h  something

我認爲：

可以有相同ID的多次觀測。重命名，連接和刪除重複項將不起作用。

數據集相當大（數百萬觀察值），因此每個值的循環都不是一個選項。

來源

2016-04-18 Peter

檢查ID1從ID2有isin值，並通過使用拖放.loc切片以過濾數據。

In [76]: df.loc[~df['ID1'].isin(df['ID2']), :] 
Out[76]: 
    ID1 ID2 variables 
0 a b something 
2 c h something 
3 d i something 
4 a h something

詳情：

In [77]: df 
Out[77]: 
    ID1 ID2 variables 
0 a b something 
1 b g something 
2 c h something 
3 d i something 
4 a h something 

In [78]: ~df['ID1'].isin(df['ID2']) 
Out[78]: 
0  True 
1 False 
2  True 
3  True 
4  True 
Name: ID1, dtype: bool 

In [79]: df.loc[~df['ID1'].isin(df['ID2']), :] 
Out[79]: 
    ID1 ID2 variables 
0 a b something 
2 c h something 
3 d i something 
4 a h something

來源

2016-04-18 13:45:43 Zero

我想你可以通過~與boolean indexing使用isin與反轉布爾Series：

print df.ID1.isin(df.ID2) 
0 False 
1  True 
2 False 
3 False 
4 False 

print ~df.ID1.isin(df.ID2) 
0  True 
1 False 
2  True 
3  True 
4  True 
Name: ID1, dtype: bool 

print df[~df.ID1.isin(df.ID2)] 
    ID1 ID2 variables 
0 a b something 
2 c h something 
3 d i something 
4 a h something

測試：

df = pd.concat([df]*100000).reset_index(drop=True) 

In [157]: %timeit df.loc[~df['ID1'].isin(df['ID2']), :] 
10 loops, best of 3: 55.5 ms per loop 

In [158]: %timeit df[~df.ID1.isin(df.ID2)] 
10 loops, best of 3: 55 ms per loop

來源

2016-04-18 13:45:25 jezrael

最簡單的方法也許

df.query('ID1 not in ID2')

來源

2016-04-18 14:08:19 PhilChang

同樣的觀察在多個列

回答

相關問題