改進代碼設計

這是關於一個更有效的代碼設計的一個問題：改進代碼設計

假設三個對準的DNA序列（SEQ1，SEQ2和SEQ3;它們各自的字符串），其表示兩個基因（基因1和基因2 ）。這些基因的起始和終止位置相對於比對的DNA序列是已知的。

# Input 
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}}

我希望刪除從對準的間隙（即，短劃線）和保持的開始的相對關聯和停止的基因的位置。

# Desired output 
align = {"seq1":"ATGCATGC", 
     "seq2":"ATGC", 
     "seq3":"ACAC"} 
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,1], "gene2":[2,3]}, 
     "seq3":{"gene1":[0,1], "gene2":[2,3]}}

獲得所需的輸出不像看起來那麼微不足道。下面我寫了一些（行編號）僞代碼對於這個問題，但肯定有一個更優雅的設計。

1 measure length of any aligned gene # take any seq, since all seqs aligned 
2 list_lengths = list of gene lengths # order is important 
3 for seq in alignment 
4  outseq = "" 
5  for each num in range(0, length(seq)) # weird for-loop is intentional 
6   if seq[num] == "-" 
7    current_gene = gene whose start/stop positions include num 
8    subtract 1 from length of current_gene 
9    subtract 1 from lengths of all genes following current_gene in list_lengths 
10   else 
11    append seq[num] to outseq 
12  append outseq to new variable 
13  convert gene lengths into start/stop positions and append ordered to new variable

任何人都可以給我一個更短，更直接的代碼設計的建議/例子嗎？

來源

2016-01-15 Michael Gruenstaeudl

尋找一個蟒蛇解決方案，這更好地被標記_python_ - 我會下降僞代碼。您已將數組從「1」重新設置爲1：考慮代表範圍/切片，包括_ [from，to] _排除。 _anno_和_align_通過「標籤」的「關聯」看起來很輕微。 - 您需要指定_genes_之間允許的重疊或間隔（如果有）。保持_annos_中的_genes_應該有所幫助 - 指定！ 8＆9可能太詳細了。受過教育的猜測：取決於代表性，13大約是複雜性的一半 - 擴大。（一旦你得到了代碼_and_仍然可以看到問題，可以考慮在CODE REVIEW上提出這個問題。） – greybeard

@greybeard我按照你的建議改變了標籤。根據你的建議（特別是第13行），僞代碼的變化即將出現。 –

只是爲了澄清，這是你的數據是什麼意思？對於序列「AT ---- GC」，「gene1」：[0,3]，「gene2」：[4,7]表示gene1爲AT - ，可縮寫爲AT '，而gene2是' - GC'，可以縮寫爲'GC'？ – Pausbrak

此答案處理從註釋到cdlanes回答的更新annos字典。該答案離開annos字典，其索引爲[2,1]的錯誤代碼爲seq2gene2。如果序列包含該區域中的所有空位，我建議的解決方案將從字典中刪除gene條目。還要注意，如果一個基因在最後的align中只包含一個字母，那麼anno[geneX]將具有相同的起始和終止索引 - >請參見seq3gene1從您的評論annos。

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}} 


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
      "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
      "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}} 

import re 
for name,anno in annos.items(): 
    # indices of gaps removed usinig re 
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])] 

    # removes gaps from align dictionary 
    align[name] = re.sub(r'-', '', align[name]) 

    build_dna = '' 
    for gene,inds in anno.items(): 

     start_ind = len(build_dna)+1 

     #generator to sum the num '-' removed from gene 
     num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1]) 

     # build the de-gapped string 
     build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "") 

     end_ind = len(build_dna) 

     if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps 
      del annos[name][gene] #remove the gene entry 
      continue 
     #update the values in the annos dictionary 
     annos[name][gene][0] = start_ind-1 
     annos[name][gene][1] = end_ind-1

結果：從上面的3基因annos

In [3]: annos 
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 
      'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 
      'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}

結果。只需更換annos變量：

In [5]: annos3 
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]}, 
      'seq2': {'gene1': [0, 1], 'gene3': [2, 3]}, 
      'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}

來源

2016-01-25 14:18:09 Kevin

_Looks_首屈一指。非常明智的評論。 – greybeard

@Kevin自動刪除僅包含間隙的註釋不是我開始考慮開始的一項功能但是，這種情況並不少見，好的代碼應該可以處理它，謝謝指出。 –

下面的示例程序的輸出相匹配的測試案例：

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq2":{"gene1":[0,3], "gene2":[4,7]}, 
     "seq3":{"gene1":[0,3], "gene2":[4,7]}} 

(START, STOP) = (0, 1) 

for alignment, sequence in align.items(): 
    new_sequence = '' 
    gap = 0 

    for position, codon in enumerate(sequence): 
     if '-' == codon: 
      for gene in annos[alignment].values(): 
       if gene[START] > (position - gap): 
        gene[START] -= 1 
       if gene[STOP] >= (position - gap): 
        gene[STOP] -= 1 
      gap += 1 
     else: 
      new_sequence += codon 

    align[alignment] = new_sequence

運行這個結果：

python3 -i test.py 
>>> align 
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'} 
>>> 
>>> annos 
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}} 
>>>

我希望這仍然是優雅的，直接的，短期和Python的足夠！

來源

2016-01-20 20:34:39 cdlane

@cdlane你的解決方案顯然比我的複雜嘗試顯得更直接和Pythonic。如果您可以修復錯誤（這似乎是系統性的，請參閱下文），我很樂意選擇您的答案。至於系統性錯誤：請使用以下示例annos測試您的代碼，並在_seq2_和_seq3_中看到_gene3_的錯誤下限：'annos = {「seq1」：{「gene1」：[0,2]，「gene2」：基因3：[5,7]，「seq2」：{「gene1」：[0,2]，「gene2」：[3,4]，「gene3」 7]}，seq3「：{」gene1「：[0,2]，」gene2「：[3,4]，」gene3「：[5,7]}}' –

@MichaelGruenstaeudl我想我已經解決了錯誤和輸出匹配您的代碼生成。 – cdlane

我自己對上述問題的解決方案既不優雅也不Pythonic，但達到了預期的輸出。任何改進建議非常歡迎！

import collections 
import operator 
# measure length of any aligned gene # take any seq, since all seqs aligned 
align_len = len(align.itervalues().next()) 
# initialize output 
align_out, annos_out = {}, {} 
# loop through annos 
for seqname, anno in annos.items(): 
# operate on ordered sequence lengths instead on ranges 
    ordseqlens = collections.OrderedDict() 
# generate ordered sequence lengths 
    for k,v in sorted(anno.items(), key=operator.itemgetter(1)): 
     ordseqlens[k] = v[1]-v[0]+1 
# start (and later append to) sequence output 
    align_out[seqname] = "" 
# generate R-style for-loop 
    for pos in range(0, len(align[seqname])): 
     if align[seqname][pos] == "-": 
      try: 
       current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1]) 
      except StopIteration: 
       print("No annotation provided for position", pos, "in sequence", seqname) 
# subtract 1 from lengths of current_gene 
      ordseqlens[current_gene] = ordseqlens[current_gene]-1 
# append nucleotide unless a gap 
     else: 
      align_out[seqname] += align[seqname][pos] 
# convert modified ordered sequence lengths back into start/stop positions 
    summ = 0 
    tmp_dict = {} 
    for k,v in ordseqlens.items(): 
     tmp_dict[k] = [summ, v+summ-1] 
     summ = v+summ 
# save start/stop positions to correct annos 
    annos_out[seqname] = tmp_dict

這段代碼的輸出是：

>>> align_out 
{'seq3': 'ACAC', 
'seq2': 'ATGC', 
'seq1': 'ATGCATGC'} 

>>> annos_out 
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}, 
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 
'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}}

來源

2016-01-20 22:18:15

所以，我認爲，試圖打破每個序列成的基因，然後刪除破折號的方法產生了很多不必要的書-保持。相反，直接查看破折號可能更容易，然後根據它們的相對位置更新所有的索引。下面是我寫的一個功能，似乎正確操作：

from copy import copy 

def rewriteGenes(align, annos): 
    alignments = copy(align) 
    annotations = copy(annos) 

    for sequence, alignment in alignments.items(): 
     while alignment.find('-') > -1: 
      index = alignment.find('-') 
      for gene, (start, end) in annotations[sequence].items(): 
       if index < start: 
        annotations[sequence][gene][0] -= 1 
       if index <= end: 
        annotations[sequence][gene][1] -= 1 
      alignment = alignment[:index] + alignment[index+1:] 
     alignments[sequence] = alignment 

    return (alignments, annotations)

這遍歷每個對準破折號和刪除它們更新基因指標。

注意，這將產生與下面的測試案例指數[2,1]基因：

align = {"seq1":"ATGCATGC", 
     "seq2":"AT----GC", 
     "seq3":"A--CA--C"} 
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
     "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
     "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

這是必要因爲你的指數設置的方式不以其他方式允許空的基因。例如，索引[2,2]將是從索引2開始的長度爲1的序列。

來源

2016-01-21 18:06:51

改進代碼設計

回答

相關問題