2016-01-15 69 views
7

這是關於一個更有效的代碼設計的一個問題:改進代碼設計

假設三個對準的DNA序列(SEQ1,SEQ2和SEQ3;它們各自的字符串),其表示兩個基因(基因1和基因2 )。這些基因的起始和終止位置相對於比對的DNA序列是已知的。

# Input 
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}} 

我希望刪除從對準的間隙(即,短劃線)和保持的開始的相對關聯和停止的基因的位置。

# Desired output 
align = {"seq1":"ATGCATGC", 
     "seq2":"ATGC", 
     "seq3":"ACAC"} 
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,1], "gene2":[2,3]}, 
     "seq3":{"gene1":[0,1], "gene2":[2,3]}} 

獲得所需的輸出不像看起來那麼微不足道。下面我寫了一些(行編號)僞代碼對於這個問題,但肯定有一個更優雅的設計。

1 measure length of any aligned gene # take any seq, since all seqs aligned 
2 list_lengths = list of gene lengths # order is important 
3 for seq in alignment 
4  outseq = "" 
5  for each num in range(0, length(seq)) # weird for-loop is intentional 
6   if seq[num] == "-" 
7    current_gene = gene whose start/stop positions include num 
8    subtract 1 from length of current_gene 
9    subtract 1 from lengths of all genes following current_gene in list_lengths 
10   else 
11    append seq[num] to outseq 
12  append outseq to new variable 
13  convert gene lengths into start/stop positions and append ordered to new variable 

任何人都可以給我一個更短,更直接的代碼設計的建議/例子嗎?

+0

尋找一個蟒蛇解決方案,這更好地被標記_python_ - 我會下降僞代碼。您已將數組從「1」重新設置爲1:考慮代表範圍/切片,包括_ [from,to] _排除。 _anno_和_align_通過「標籤」的「關聯」看起來很輕微。 - 您需要指定_genes_之間允許的重疊或間隔(如果有)。保持_annos_中的_genes_應該有所幫助 - 指定! 8&9可能太詳細了。受過教育的猜測:取決於代表性,13大約是複雜性的一半 - 擴大。 (一旦你得到了代碼_and_仍然可以看到問題,可以考慮在CODE REVIEW上提出這個問題。) – greybeard

+0

@greybeard我按照你的建議改變了標籤。根據你的建議(特別是第13行),僞代碼的變化即將出現。 –

+0

只是爲了澄清,這是你的數據是什麼意思?對於序列「AT ---- GC」,「gene1」:[0,3],「gene2」:[4,7]表示gene1爲AT - ,可縮寫爲AT ',而gene2是' - GC',可以縮寫爲'GC'? – Pausbrak

回答

2

此答案處理從註釋到cdlanes回答的更新annos字典。該答案離開annos字典,其索引爲[2,1]的錯誤代碼爲seq2gene2。如果序列包含該區域中的所有空位,我建議的解決方案將從字典中刪除gene條目。還要注意,如果一個基因在最後的align中只包含一個字母,那麼anno[geneX]將具有相同的起始和終止索引 - >請參見seq3gene1從您的評論annos

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}} 


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
      "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
      "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} 

import re 
for name,anno in annos.items(): 
    # indices of gaps removed usinig re 
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])] 

    # removes gaps from align dictionary 
    align[name] = re.sub(r'-', '', align[name]) 

    build_dna = '' 
    for gene,inds in anno.items(): 

     start_ind = len(build_dna)+1 

     #generator to sum the num '-' removed from gene 
     num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1]) 

     # build the de-gapped string 
     build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "") 

     end_ind = len(build_dna) 

     if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps 
      del annos[name][gene] #remove the gene entry 
      continue 
     #update the values in the annos dictionary 
     annos[name][gene][0] = start_ind-1 
     annos[name][gene][1] = end_ind-1 

結果:從上面的3基因annos

In [3]: annos 
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 
      'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 
      'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}} 

結果。只需更換annos變量:

In [5]: annos3 
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]}, 
      'seq2': {'gene1': [0, 1], 'gene3': [2, 3]}, 
      'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}} 
+1

_Looks_首屈一指。非常明智的評論。 – greybeard

+0

@Kevin自動刪除僅包含間隙的註釋不是我開始考慮開始的一項功能但是,這種情況並不少見,好的代碼應該可以處理它,謝謝指出。 –

1

下面的示例程序的輸出相匹配的測試案例:

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}} 

(START, STOP) = (0, 1) 

for alignment, sequence in align.items(): 
    new_sequence = '' 
    gap = 0 

    for position, codon in enumerate(sequence): 
     if '-' == codon: 
      for gene in annos[alignment].values(): 
       if gene[START] > (position - gap): 
        gene[START] -= 1 
       if gene[STOP] >= (position - gap): 
        gene[STOP] -= 1 
      gap += 1 
     else: 
      new_sequence += codon 

    align[alignment] = new_sequence 

運行這個結果:

python3 -i test.py 
>>> align 
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'} 
>>> 
>>> annos 
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}} 
>>> 

我希望這仍然是優雅的,直接的,短期和Python的足夠!

+0

@cdlane你的解決方案顯然比我的複雜嘗試顯得更直接和Pythonic。如果您可以修復錯誤(這似乎是系統性的,請參閱下文),我很樂意選擇您的答案。至於系統性錯誤:請使用以下示例annos測試您的代碼,並在_seq2_和_seq3_中看到_gene3_的錯誤下限:'annos = {「seq1」:{「gene1」:[0,2],「gene2」 :基因3:[5,7],「seq2」:{「gene1」:[0,2],「gene2」:[3,4],「gene3」 7]},seq3「:{」gene1「:[0,2],」gene2「:[3,4],」gene3「:[5,7]}}' –

+0

@MichaelGruenstaeudl我想我已經解決了錯誤和輸出匹配您的代碼生成。 – cdlane

0

我自己對上述問題的解決方案既不優雅也不Pythonic,但達到了預期的輸出。任何改進建議非常歡迎!

import collections 
import operator 
# measure length of any aligned gene # take any seq, since all seqs aligned 
align_len = len(align.itervalues().next()) 
# initialize output 
align_out, annos_out = {}, {} 
# loop through annos 
for seqname, anno in annos.items(): 
# operate on ordered sequence lengths instead on ranges 
    ordseqlens = collections.OrderedDict() 
# generate ordered sequence lengths 
    for k,v in sorted(anno.items(), key=operator.itemgetter(1)): 
     ordseqlens[k] = v[1]-v[0]+1 
# start (and later append to) sequence output 
    align_out[seqname] = "" 
# generate R-style for-loop 
    for pos in range(0, len(align[seqname])): 
     if align[seqname][pos] == "-": 
      try: 
       current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1]) 
      except StopIteration: 
       print("No annotation provided for position", pos, "in sequence", seqname) 
# subtract 1 from lengths of current_gene 
      ordseqlens[current_gene] = ordseqlens[current_gene]-1 
# append nucleotide unless a gap 
     else: 
      align_out[seqname] += align[seqname][pos] 
# convert modified ordered sequence lengths back into start/stop positions 
    summ = 0 
    tmp_dict = {} 
    for k,v in ordseqlens.items(): 
     tmp_dict[k] = [summ, v+summ-1] 
     summ = v+summ 
# save start/stop positions to correct annos 
    annos_out[seqname] = tmp_dict 

這段代碼的輸出是:

>>> align_out 
{'seq3': 'ACAC', 
'seq2': 'ATGC', 
'seq1': 'ATGCATGC'} 

>>> annos_out 
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}, 
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 
'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}} 
0

所以,我認爲,試圖打破每個序列成的基因,然後刪除破折號的方法產生了很多不必要的書-保持。相反,直接查看破折號可能更容易,然後根據它們的相對位置更新所有的索引。下面是我寫的一個功能,似乎正確操作:

from copy import copy 

def rewriteGenes(align, annos): 
    alignments = copy(align) 
    annotations = copy(annos) 

    for sequence, alignment in alignments.items(): 
     while alignment.find('-') > -1: 
      index = alignment.find('-') 
      for gene, (start, end) in annotations[sequence].items(): 
       if index < start: 
        annotations[sequence][gene][0] -= 1 
       if index <= end: 
        annotations[sequence][gene][1] -= 1 
      alignment = alignment[:index] + alignment[index+1:] 
     alignments[sequence] = alignment 

    return (alignments, annotations) 

這遍歷每個對準破折號和刪除它們更新基因指標。

注意,這將產生與下面的測試案例指數[2,1]基因:

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
     "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
     "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} 

這是必要因爲你的指數設置的方式不以其他方式允許空的基因。例如,索引[2,2]將是從索引2開始的長度爲1的序列。