2014-04-02 134 views
1

我有一個BLAST表格輸出,有數百萬個匹配.Con是我的序列,P是蛋白質命中。我有興趣區分與以下3種情況對應的點擊。它們應該全部打印在3個獨立的新文件中,並且在文件1中的重疊羣不應該在文件2,3等中。如何做到這一點?查找重疊區域和非重疊區域

con1 ----------------------- (Contigs with both overlapping and non overlapping hits) 
    p1---- p2 ------ p4--- 
       p3----- 
con2 --------------------- (only overlapping) con3 ----------------(only non overlp) 
    p1 -----          p1 ---- p2 ----- 
     p2 ------- 
      p3 ----- 

如果我知道蛋白質開始和結束位點,就有可能確定重疊或非重疊;如果S1 < E2 < S2和E1 < S2 < E2或S2 - E1> 0 我輸入文件,即

contig protein start end 
con1 P1 481 931 
con1 P2 140 602 
con1 P3 232 548 
con2 P4 335 406 
con2 P5 642 732 
con2 P6 2282 2433 
con2 P7 729 812 
con3 P8 17 148 
con3 P9 289 45 

我的腳本(這我打印僅命中不重疊)

from itertools import groupby 

def nonoverlapping(hits): 
    """Returns a list of non-overlapping hits.""" 
    nonover = [] 
    overst = False 
    for i in range(1,len(hits)): 
     (p, c) = hits[i-1], hits[i] 
     if c[2] > p[3]: 
      if not overst: nonover.append(p) 
      nonover.append(c) 
      overst = True 
    return nonover 

fh = open('file.txt') 
oh = open('result.txt', 'w') 
for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[2], hsp[3] = int(hsp[2]), int(hsp[3]) 
     hits.append(hsp) 
    if len(hits) > 1: 
     hits.sort(key=lambda x: x[2]) 
     for hit in nonoverlapping(hits): 
      oh.write('\t'.join([str(f) for f in hit])+'\n') 

回答

2

我會做這樣的事情。爲兩個命中定義一個「重疊」函數,然後測試每個重疊羣是全部,部分還是不重疊。然後寫所有的重疊羣到所需的文件:

from itertools import groupby 

def overlaps(a, b): 
    result = True 
    # Supposing a[2] is the start, a[3] the end. 
    # If end before start, they are not overlapping 
    if a[3] < b[2] or b[3] < a[2]: 
     result = False 

    return result 

def test_overlapping(hits): 
    overlapping = 'None' 
    overlapping_count = 0 
    for i in range(len(hits)-1): 
     if overlaps(hits[i], hits[i+1]): 
      overlapping_count += 1 


    if overlapping_count == 0: 
     overlapping = 'None' 
    elif overlapping_count == len(hits) -1: 
     overlapping = 'All' 
    else: 
     overlapping = 'Some' 

    return overlapping 


fh = open('file.txt') 

file_all = open('result_all.txt', 'w') 
file_some = open('result_some.txt', 'w') 
file_none = open('result_none.txt', 'w') 

line = fh.readline() # quit header 

for qid, grp in groupby(fh, lambda l: l.split()[0]): 
    hits = [] 
    for line in grp: 
     hsp = line.split() 
     hsp[2], hsp[3] = int(hsp[2]), int(hsp[3]) 
     hits.append(hsp) 
    if len(hits) > 1: 
     hits.sort(key=lambda x: x[2]) 
     overlapping = test_overlapping(hits) 
     out_file = file_none 
     if overlapping == 'All': 
      out_file = file_all 
     elif overlapping == 'Some': 
      out_file = file_some 

     for h in hits: 
      out_file.write('\t'.join([str(v) for v in h])) 
      out_file.write('\n') 


file_all.close() 
file_some.close() 
file_none.close() 
+0

謝謝你,我把你的算法的路線,但不反芻運行script..what如果我只補充,如果C [2] <= P [3]或c [2]> p [3]:行? – user3224522

+0

我測試了整個事情,現在就把它。這個對我有用。問題是,對於每個重疊羣,你需要看它是否有重疊的讀取,看看你想寫在哪個文件中,對嗎?這就是我所理解的。讓我知道如果這可以解決問題:) – cnluzon

+1

你知道,它似乎工作,我只需要仔細檢查,因爲我的數據是如此巨大,非常感謝你! – user3224522