2014-02-15 22 views
1

我有一個csv文件,我想從中保存唯一的記錄。在這個文件中,我有第四個字段,它有一些文本,然後是人類或鼠標名稱。像... RHPN1_HUMAN和EPHA5_MOUSEPython從csv文件中提取唯一記錄

因此,例如:EPHA5發生在人類和鼠標,所以我想刪除這個記錄,因爲RHPN1只發生在人類,所以我想保留這個記錄。

file1.csv

meNOG00001 9606 ENSP00000289013   RHPN1_HUMAN 

meNOG00005 10090 ENSMUSP00000060646 EPHA5_MOUSE 

meNOG00005 9606 ENSP00000273854   EPHA5_HUMAN 

meNOG00006 10090 ENSMUSP00000082503 RGPA1_MOUSE 

meNOG00006 9606 ENSP00000202677   RGPA2_HUMAN 

meNOG00006 9606 ENSP00000302647   RGPA1_HUMAN 

meNOG00010 9606 ENSP00000253669   HAUS8_HUMAN 

meNOG00011 10090 ENSMUSP00000017629 TOP2B_MOUSE 

meNOG00011 10090 ENSMUSP00000068896 TOP2A_MOUSE 

meNOG00011 9606 ENSP00000396704   TOP2B_HUMAN 

meNOG00011 9606 ENSP00000411532   TOP2A_HUMAN 

output.csv

meNOG00001 9606 ENSP00000289013   RHPN1_HUMAN 

meNOG00006 9606 ENSP00000202677   RGPA2_HUMAN 

meNOG00010 9606 ENSP00000253669   HAUS8_HUMAN 

我試過,但我的代碼不能正常工作,因爲我想...

file1 = open("file1.csv", "rU") 
reader1 = csv.reader(file1,delimiter=',') 

d =[] 
c =[] 
for row in reader1: 
    d.append(row[3].split('_')[0]) 
d=list(set(d)) 

for row1 in d: 
    for row2 in reader1: 
     if row1 == row2[3].split('_')[0]: 
       c.append(row2) 

    file1.seek(0) 

with open('output.csv', 'w') as f_out: 
    writer = csv.writer(f_out, delimiter=',') 
    for k in c: 
     writer.writerow(k) 

回答

1
import csv 
import collections 
data = collections.OrderedDict()   # 2 
with open("file1.csv", "rU") as f: 
    reader = csv.reader(f, delimiter=',') 
    for row in reader: 
     key = row[3].split('_')[0] 
     if key in data: 
      del data[key]     # 1 
     else: 
      data[key] = row     

with open('output.csv', 'w') as f_out: 
    writer = csv.writer(f_out, delimiter=',') 
    writer.writerows(data.values()) 
  1. 如果鑰匙是看到不止一次,然後從字典中刪除該項目。只要密鑰最多可以看到兩次,這將刪除重複項。
  2. 使用OrderDict,這樣行將保持順序。如果這不是對你很重要的 ,你可以使用常規的dict

如果密鑰可以發生兩次以上,那麼你就需要用不同的方式來跟蹤哪些鍵被看見。你可以使用一套。例如,

import csv 
import collections 
seen = set() 
data = collections.OrderedDict()    
with open("file1.csv", "rU") as f: 
    reader = csv.reader(f, delimiter=',') 
    for row in reader: 
     key = row[3].split('_')[0] 
     if key in seen: 
      del data[key] 
     else: 
      data[key] = row     
      seen.add(key) 

with open('output.csv', 'w') as f_out: 
    writer = csv.writer(f_out, delimiter=',') 
    writer.writerows(data.values()) 
+0

這給meNOG00001 ENSP0000028901 \t RHPN1_HUMAN
meNOG00005 ENSP00000273854 \t EPHA5_HUMAN
meNOG00006 ENSP00000302647 \t RGPA1_HUMAN
meNOG00006 ENSP00000202677 \t RGPA2_HUMAN
meNOG00010 ENSP00000253669 \t HAUS8_HUMAN
meNOG00011 ENSP00000396704 \t TOP2B_HUMAN
meNOG00011 ENSP00000411532 \t TOP2A_HUMAN
我需要
meNOG00001 9606 ENSP00000289013 RHPN1_HUMAN
meNOG00006 9606 ENSP00000202677 RGPA2_HUMAN
meNOG00010 9606 ENSP00000253669 HAUS8_HUMAN
user587739

0

不完全測試,但你可以使用這樣的事情:

class OD(OrderedDict): 
    coll = set() 
    def __setitem__(self, key, value): 
     if key in self.coll: 
      try: 
       del self[key] 
      except KeyError: 
       pass 
     else: 
      OrderedDict.__setitem__(self, key, value) 
      self.coll.add(key) 

這樣做的原因是,我不知道你是否將有超過2場比賽。例如,如果您有奇數個匹配代碼,則無法匹配字典中的密鑰 - 因爲任何奇數個密鑰都將被視爲唯一密鑰。但是,上述將工作。 (這可能是矯枉過正雖然)

d = OD() 

with open("file1.csv", "rU") as f_in: 
    reader = csv.reader(f_in, delimiter=',') 
    for row in reader: 
     key = row[3].split('_')[0] 
     d[key] = row 

with open('output.csv', 'w') as f_out: 
    writer = csv.writer(f_out, delimiter=',') 
    for val in d.values(): 
     writer.writerow(val)