2014-06-19 173 views
3

我試圖將一系列基因組座標合併爲連續範圍,並有一個用於合併間隙的附加選項。將重疊的數字範圍合併爲連續的範圍

例如,如果我有基因組範圍[[0, 1000], [5, 1100]]我想要的結果是[0, 1100]。如果偏移選項設置爲100,並且輸入爲[[0, 1000], [1090, 1000]],我會再次希望結果爲[0, 1100]

我已經實現了這樣一種方式,它按順序逐步對齊並嘗試合併上一個結束位置和下一個起始位置,但由於實際結果具有不同的長度而失敗。例如,我的結果[[138, 821],[177, 1158], [224, 905], [401, 1169]]在我的列表中按起始位置排序。答案應該是[138, 1169],但我反而得到[[138, 1158], [177, 905], [224, 1169]]。顯然,我需要考慮的不僅僅是前一個結局和下一個開始,但我還沒有找到一個好的解決方案(最好不是一個if語句的巨大嵌套)。任何人有任何建議?

def overlap_alignments(align, gene, overlap): 
    #make sure alignments are sorted first by chromosome then by start pos on chrom 
    align = sorted(align, key = lambda x: (x[0], x[1])) 
    merged = list() 
    for i in xrange(1, len(align)): 
     prv, nxt = align[i-1], align[i] 
     if prv[0] == nxt[0] and prv[2] + overlap >= nxt[1]: 
      start, end = prv[1], nxt[2] 
      chrom = prv[0] 
      merged.append([chrom, start, end, gene]) 
    return merged 

回答

4

那麼,如何跟蹤的每一個開始和結束和範圍的數,其中每個位置屬於?

def overlap_alignments(align, overlap): 
    # create a list of starts and ends 
    stends = [ (a[0], 1) for a in align ] 
    stends += [ (a[1] + overlap, -1) for a in align ] 
    stends.sort(key=lambda x: x[0]) 

    # now we should have a list of starts and ends ordered by position, 
    # e.g. if the ranges are 5..10, 8..15, and 12..13, we have 
    # (5,1), (8,1), (10,-1), (12,1), (13,-1), (15,-1) 

    # next, we form a cumulative sum of this 
    s = 0 
    cs = [] 
    for se in stends: 
     s += se[1] 
     cs.append((se[0], s)) 
    # this is, with the numbers above, (5,1), (8,2), (10,1), (12,2), (13,1), (15,0) 
    # so, 5..8 belongs to one range, 8..10 belongs to two overlapping range, 
    # 10..12 belongs to one range, etc 

    # now we'll find all contiguous ranges 
    # when we traverse through the list of depths (number of overlapping ranges), a new 
    # range starts when the earlier number of overlapping ranges has been 0 
    # a range ends when the new number of overlapping ranges is zero 
    prevdepth = 0 
    start = 0 
    combined = [] 
    for pos, depth in cs: 
     if prevdepth == 0: 
      start = pos 
     elif depth == 0 
      combined.append((start, pos-overlap)) 
     prevdepth = depth 

    return combined 

這會比繪製更容易解釋。 (是的,累積和可以寫在更短的空間,但我發現這樣更清晰。)

爲了解釋這個圖形,讓我們把輸入([5,10],[8,15],[ 12,13],[16,20])並且重疊= 1。

.....XXXXXo.............. (5-10) 
........XXXXXXXo......... (8-15) 
............Xo........... (12-13) 
................XXXXo.... (16-20) 
.....1112221221111111.... number of ranges at each position 
.....----------------.... number of ranges > 0 
.....---------------..... overlap corrected (5-20) 
3

Python帶有batteries included

from itertools import chain 

flatten = chain.from_iterable 

LEFT, RIGHT = 1, -1 

def join_ranges(data, offset=0): 
    data = sorted(flatten(((start, LEFT), (stop + offset, RIGHT)) 
      for start, stop in data)) 
    c = 0 
    for value, label in data: 
     if c == 0: 
      x = value 
     c += label 
     if c == 0: 
      yield x, value - offset 

if __name__ == '__main__': 
    print list(join_ranges([[138, 821], [900, 910], [905, 915]])) 
    print list(join_ranges([[138, 821], [900, 910], [905, 915]], 80)) 

結果:

[(138, 821), (900, 915)] 
[(138, 915)] 

它是如何工作的:我們的標籤每次開始和結束點作爲這樣的,那麼我們排序,之後我們簡單地計算每個起始點爲,每個終點爲下降。如果我們訪問了相同數量的起點和終點,則我們有一個關閉(加入)的範圍。