2014-07-10 22 views
2

我有2列類似下面的數據集之間存在......熊貓下降重複,如果是相反的兩列

InteractorA InteractorB 
AGAP028204 AGAP005846 
AGAP028204 AGAP003428 
AGAP028200 AGAP011124 
AGAP028200 AGAP004335 
AGAP028200 AGAP011356 
AGAP028194 AGAP008414 

我使用的是熊貓,我想刪除這是目前排兩次,但只是相反像下面......從這個...

InteractorA InteractorB 
AGAP002741 AGAP008026 
AGAP008026 AGAP002741 

要這個......

InteractorA InteractorB 
AGAP002741 AGAP008026 

,因爲它們是所有意圖s和目的是一樣的。

是否有內置的方法來處理這個問題?

回答

3

我最終制作了一個hacky腳本,它遍歷行和必要的數據片段,並檢查連接是否出現,或者是否出現反轉,並根據需要刪除行索引。

import pandas as pd 

checklist = [] 
indexes_to_drop = [] 

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t') 

for index, row in interactions.iterrows(): 
    check_string = row['InteractorA'] + row['InteractorB'] 
    check_string_rev = row['InteractorB'] + row['InteractorA'] 
    if (check_string or check_string_rev) in checklist: 
     indexes_to_drop.append(index) 
    else: 
     pass 
    checklist.append(check_string) 
    checklist.append(check_string_rev) 

no_dups = interactions.drop(interactions.index[indexes_to_drop]) 

print no_dups.shape 

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False) 

2017年編輯:上幾年,有位更有經驗,這是任何人都在尋找類似的東西一個更優雅的解決方案:

In [8]: df 
Out[8]: 
    InteractorA InteractorB 
0 AGAP028204 AGAP005846 
1 AGAP028204 AGAP003428 
2 AGAP028200 AGAP011124 
3 AGAP028200 AGAP004335 
4 AGAP028200 AGAP011356 
5 AGAP028194 AGAP008414 
6 AGAP002741 AGAP008026 
7 AGAP008026 AGAP002741 

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1) 

In [19]: df 
Out[19]: 
    InteractorA InteractorB   check_string 
0 AGAP028204 AGAP005846 AGAP005846AGAP028204 
1 AGAP028204 AGAP003428 AGAP003428AGAP028204 
2 AGAP028200 AGAP011124 AGAP011124AGAP028200 
3 AGAP028200 AGAP004335 AGAP004335AGAP028200 
4 AGAP028200 AGAP011356 AGAP011356AGAP028200 
5 AGAP028194 AGAP008414 AGAP008414AGAP028194 
6 AGAP002741 AGAP008026 AGAP002741AGAP008026 
7 AGAP008026 AGAP002741 AGAP002741AGAP008026 

In [20]: df.drop_duplicates('check_string') 
Out[20]: 
    InteractorA InteractorB   check_string 
0 AGAP028204 AGAP005846 AGAP005846AGAP028204 
1 AGAP028204 AGAP003428 AGAP003428AGAP028204 
2 AGAP028200 AGAP011124 AGAP011124AGAP028200 
3 AGAP028200 AGAP004335 AGAP004335AGAP028200 
4 AGAP028200 AGAP011356 AGAP011356AGAP028200 
5 AGAP028194 AGAP008414 AGAP008414AGAP028194 
6 AGAP002741 AGAP008026 AGAP002741AGAP008026 
0

我認爲有以下將工作:

In [37]: 
import pandas as pd 
import io 
temp = """InteractorA InteractorB 
AGAP028204 AGAP005846 
AGAP028204 AGAP003428 
AGAP028200 AGAP011124 
AGAP028200 AGAP004335 
AGAP028200 AGAP011356 
AGAP028194 AGAP008414 
AGAP002741 AGAP008026 
AGAP008026 AGAP002741""" 
df = pd.read_csv(io.StringIO(temp), sep='\s+') 
df 
Out[37]: 
    InteractorA InteractorB 
0 AGAP028204 AGAP005846 
1 AGAP028204 AGAP003428 
2 AGAP028200 AGAP011124 
3 AGAP028200 AGAP004335 
4 AGAP028200 AGAP011356 
5 AGAP028194 AGAP008414 
6 AGAP002741 AGAP008026 
7 AGAP008026 AGAP002741 

所以,我下載你的數據和誤解你想要什麼,所以下面將現在的工作:

# first get the values that are unique 
In [72]: 
df1 = df[~df.InteractorA.isin(df.InteractorB)] 
df1.shape 
Out[72]: 
(2386, 2) 

現在,我們想要得到的重複的行但取第一個值:

In [74]: 

df2 = df[df.InteractorA.isin(df.InteractorB)] 
df2 = df2.groupby('InteractorA').first().reset_index() 
df2.shape 
Out[74]: 
(3074, 2) 

現在連接到2個數據幀:

In [75]: 

merged = pd.concat([df1, df2], ignore_index=True) 
merged.shape 
Out[75]: 
(5460, 2) 

我認爲現在是正確的。

+0

這似乎擺脫其中的一些,但不是全部,例如我仍然有'AGAP007031 \t AGAP010 265'和'AGAP010265 \t AGAP007031'出現在我的數據集中。 – BML91

+0

仍然適用於我,您是否可以發佈更多數據,以便我可以瞭解這個失敗的位置 – EdChum

+0

確定數據集位於此處 - https://dl.dropboxusercontent.com/u/6037105/interactions_unique。txt – BML91

0

這是最徹底的解決方案我已經成功地爲自己的目的而工作。

創建一個具有各行結合在排序列表中

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)] 

無法在名單上重複的下降,使列應爲字符串

df['sorted_row'] = df['sorted_row'].astype(str) 

刪除重複

df.drop_duplicates(subset=['sorted_row'], inplace=True)