2014-04-04 61 views
1

我有一個看起來像這樣一個巨大的輸入文件,在一個循環中刪除冗餘使用python

contig protein start end 
con1 P1 140 602 
con1 P2 140 602 
con1 P3 232 548 
con2 P4 335 801 
con2 P5 642 732 
con2 P6 335 779 
con2 P7 729 812 
con3 P8 17 348 
con3 P9 16 348 

我想刪除同源普的或多餘的普的,我以爲是那些具有相同的開始和結束網站和分別具有較小開始或結束網站的網站。所以,我的輸出文件將是這樣的, file.txt的

con1 P1 140 602 
con1 P3 232 548 
con2 P4 335 801 
con2 P7 729 812 

試圖腳本,由於某種原因,不符合這兩個條件,

from itertools import groupby 
def non_homolog(hits): 
    nonhomolog=[] 
    overst = False 
    for i in range(1,len(hits)): 
     (p, c) = hits[i-1], hits[i] 
     if p[2] <= c[2] and c[3] <= p[3]: 
      if not overst: nonhomolog.append(c) 
      nonhomolog.append(c) 
      overst = True 
    return nonhomolog 

fh = open('example.txt') 
oh = open('nonhomologs.txt', 'w') 
for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[2], hsp[3] = int(hsp[2]), int(hsp[3]) 
     hits.append(hsp) 
    hits.sort(key=lambda x: x[2]) 
    if non_homolog(hits): 
     for hit in hits: 
      oh.write('\t'.join([str(f) for f in hit])+'\n') 
+1

可能的問題:如果我們有三個項目,說'con 1的P1 140 602','con 1的P2 144 602'。 con1 p3 148 602' => p1與p2,p2至p3同源,但不同於p1至p3;這應該如何處理? –

+0

我主要感興趣的是一旦有0的差異,基本相同的起始站點,那麼,只是爲了比較結果考慮也有+5單位差異,可能會少一些。所以我想,你提到的案例在後面的案例中會涉及同系物。 – user3224522

+0

@Hugh Bothwell,說實話,我有點熟練地解決了這些問題,所以現在可能會更好地專注於那些相同的問題,我必須考慮這種情況,因爲我的一些對手的命中率低於10-20 .. – user3224522

回答

1

試試這個關於大小:

# this code assumes Python 2.7 
from itertools import groupby, izip 
from operator import attrgetter 

INPUT = "file.txt" 
HOMO_YES = "homologs.txt" 
HOMO_NO = "nonhomologs.txt" 
MAX_DIFF = 5 

class Row: 
    __slots__ = ["line", "con", "protein", "start", "end"] 

    def __init__(self, s): 
     self.line = s.rstrip() 
     data   = s.split() 
     self.con  = data[0] 
     self.protein = data[1] 
     self.start = int(data[2]) 
     self.end  = int(data[3]) 

    def __str__(self): 
     return self.line 

def count_homologs(items, max_diff=MAX_DIFF): 
    num_items = len(items) 
    counts  = [0] * num_items 
    # first item 
    for i, item_i in enumerate(items): 
     max_start = item_i.start + max_diff 
     max_end = item_i.end + max_diff 
     # second item 
     for j in xrange(i+1, num_items): 
      item_j = items[j] 
      if item_j.start > max_start: 
       break 
      elif item_j.end <= max_end: 
       counts[i] += 1 
       counts[j] += 1 
    return counts 

def main(): 
    with open(INPUT) as inf, open(HOMO_YES, "w") as outhomo, open(HOMO_NO, "w") as outnothomo: 
     # skip header 
     next(inf, '') 
     rows = (Row(line) for line in inf) 

     for con, item_iter in groupby(rows, key=attrgetter("con")): 
      # per-con list of Rows sorted by start,end 
      items = sorted(item_iter, key=attrgetter("start", "end")) 
      # get #homologs for each item 
      counts = count_homologs(items) 
      # do output 
      for c,item in izip(counts, items): 
       if c: 
        outhomo.write(str(item) + "\n") 
       else: 
        outnothomo.write(str(item) + "\n") 

if __name__=="__main__": 
    main() 

在你給出的數據,產生:

=== homologs.txt ===

con1 P1 140 602 
con1 P2 140 602 
con3 P9 16 348 
con3 P8 17 348 

=== nonhomologs.txt ===

con1 P3 232 548 
con2 P6 335 779 
con2 P4 335 801 
con2 P5 642 732 
con2 P7 729 812 
+0

嗨,謝謝你的回答!我試圖運行它,它給了我一個無效的語法錯誤時,打開塊使用.. – user3224522

+0

你使用的是什麼版本的Python? –

+0

python 2.6,由於限制,我不能升級。 – user3224522