2017-03-31 57 views
0

好了,所以生病得到開門見山這裏是我的代碼Python字符串分裂與多重分割點

def digestfragmentwithenzyme(seqs, enzymes): 

fragment = [] 
for seq in seqs: 
    for enzyme in enzymes: 
     results = [] 
     prog = re.compile(enzyme[0]) 
     for dingen in prog.finditer(seq): 
      results.append(dingen.start() + enzyme[1]) 
     results.reverse() 
     #result = 0 
     for result in results: 
      fragment.append(seq[result:]) 
      seq = seq[:result] 
     fragment.append(seq[:result]) 
fragment.reverse() 
return fragment 

輸入此功能是多串(SEQ)例如列表:

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 

和酶作爲輸入:

[["TC", 1],["GC",1]] 

(注:可以有多個給出,但他們大多是在這個問題上的字母與ATCG)

該函數返回一個列表,在這個例子中,包含2個列表:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]] 

現在我有麻煩了splitti將其重複兩次並獲得正確的輸出。

有關該功能的更多信息。它通過字符串(seq)查看識別點。在這種情況下,TC或GC將其分解到酶的第二個指標上。它應該爲兩個酶的列表中的兩個字符串做到這一點。

+0

這可能有助於詳細說明「正確的輸出」究竟是什麼。如果你的程序沒有做到你想要的,那麼它將不會幫助我們的讀者理解輸入序列,酶列表和輸出列表之間的關係究竟是什麼。很明顯,它不僅僅是一個簡單的子查詢。 – Risadinha

+0

對於初學者來說'prog'是一個正則表達式,應該對一個字符串進行操作,而'seq'是一個字符串列表,所以'prog.finditer(seq)'是一個錯誤。您需要一次處理一個輸入字符串。 –

+0

@AlexHall是的,我試了seqs中的seq(在參數aswel中改變它),但它沒有給我正確的輸出 –

回答

1

假設我們的想法是在每個酶上分裂,在酶的多個字母的指數點,分裂,本質上來自兩個字母之間。不需要正則表達式。

您可以通過查找出現位置並在正確的索引處插入拆分指示符,然後後處理結果以實際拆分來完成此操作。

例如:

def digestfragmentwithenzyme(seqs, enzymes): 
    # preprocess enzymes once, then apply to each sequence 
    replacements = [] 
    for enzyme in enzymes: 
     replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:])) 
    result = [] 
    for seq in seqs: 
     for r in replacements: 
      seq = seq.replace(r[0], r[1]) # So AATTC becomes AATT|C 
     result.append(seq.split('|'))  # So AATT|C becomes AATT, C 
    return result 

def test(): 
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
    enzymes = [["TC", 1],["GC",1]] 
    print digestfragmentwithenzyme(seqs, enzymes) 
+0

不,酶的長度可能超過2個字母,索引可能大於或小於2.它可以是0-5的任何值,字母沒有最小或最大長度 –

+0

因此,對於酶[ 'AAT',2],那麼'AATACCG'變成'AA','TACCG',但對於['AAT',1]則是'A','AATCCG'? – pbuck

+0

是的,但['AAT',1]會變成['A','ATCCG'] –

1

這裏是我的解決方案:

更換TCT CGCG C(這是基於指數給出完成),然後根據空間性格分裂... 。

def digest(seqs, enzymes): 
    res = [] 
    for li in seqs: 
     for en in enzymes: 
      li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:]) 
     r = li.split() 
     res.append(r) 
    return res 
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = [["TC", 1],["GC",1]] 
#enzymes = [["AAT", 2],["GC",1]] 
print seqs 
print digest(seqs, enzymes) 

結果是:

([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC'] 
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA 
AAAAT', 'C']] 

([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC'] 
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', ' 
TC']] 
0

這是應該的工作使用正則表達式。在這個解決方案中,我發現你的酶串的所有事件,並使用它們相應的索引進行分割。

def digestfragmentwithenzyme(seqs, enzymes): 
    out = [] 
    dic = dict(enzymes) # dictionary of enzyme indices 

    for seq in seqs: 
     sub = [] 
     pos1 = 0 

     enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case 
     for match in re.finditer('('+enzstr+')', seq): 
      index = dic[match.group(0)] 
      pos2 = match.start()+index 
      sub.append(seq[pos1:pos2]) 
      pos1 = pos2 
     sub.append(seq[pos1:]) 
     out.append(sub) 
     # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']] 
    return out 
+0

我喜歡你的,但是有什麼辦法可以讓它使用1種酶而不是總是需要2種或更多?也許有:如果酶> 1: –

+0

@NathanWeesie據我所知,它已經與1酶...一起工作...你爲什麼說代碼需要2個或更多? –

0

使用正回顧後發和前瞻的正則表達式搜索:

import re 


def digest_fragment_with_enzyme(sequences, enzymes): 
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes) 
    print pattern # prints ((?<=T)(?=C))|((?<=G)(?=C)) 
    for seq in sequences: 
     indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)] 
     yield [seq[start: end] for start, end in zip(indices, indices[1:])] 

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = [["TC", 1], ["GC", 1]] 
print list(digest_fragment_with_enzyme(seq, enzymes)) 

輸出:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], 
['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']] 
0

我能想到的最簡單的回答:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = ['TC', 'GC'] 
output = [] 
for string in input_list: 
    parts = [] 
    left = 0 
    for right in range(1,len(string)): 
     if string[right-1:right+1] in enzymes: 
      parts.append(string[left:right]) 
      left = right 
    parts.append(string[left:]) 
    output.append(parts) 
print(output) 
0

在這裏把我的帽子扔在戒指裏。

  • 使用字典而不是列表的列表。
  • 像其他人一樣加入模式以避免花哨的正則表達式。

import re 

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
patterns = { 'TC': 1, 'GC': 1 } 

def intervals(patterns, text): 
    pattern = '|'.join(patterns.keys()) 
    start = 0 
    for match in re.finditer(pattern, text): 
    index = match.start() + patterns.get(match.group()) 
    yield text[start:index] 
    start = index 
    yield text[index:len(text)] 

print [list(intervals(patterns, s)) for s in sequences] 

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]