用一個正則表達式做這件事實際上很困難,因爲大多數使用不需要想要重疊匹配。然而,你可以用一些簡單的迭代來做到這一點:
regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
matches.append(match)
tmp = tmp[match.start()-2:] #Back up two to get the UG portion. Shouldn't matter, but safer.
for m in matches:
print m.group(0)
雖然,這有一些問題。你認爲AUGUAGUGAUAA
的回報是多少?有兩個字符串要退回嗎?還是隻有一個?目前,您的正則表達式甚至無法捕捉到UAG
,因爲它繼續匹配UAGUGA
並在UAA
處截斷。爲了解決這個問題,你可能希望使用?
操作符讓操作符懶惰 - 這種方法將無法捕獲更長的子字符串。
也許迭代字符串兩次是答案,但如果你的RNA序列包含AUGAUGUAGUGAUAA
會怎樣?那裏有什麼正確的行爲?
我可能有利於正則表達式免費的方式,通過遍歷字符串及其子:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append(RNA[start+3:])
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate.find(terminator, end) > -1):
end = candidate.find(terminator, end)
matches.append(candidate[:end])
end += 1
for match in matches:
print match
這樣一來,你一定會得到所有的比賽,不管是什麼。
如果你需要保持每場比賽的位置的軌跡,您可以修改您的考生數據結構使用哪個維持起始位置的元組:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append((RNA[start+3:], start+3))
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate[0].find(terminator, end) > -1):
end = candidate[0].find(terminator, end)
matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
end += 1
for match in matches:
print "%d - %d: %s" % match
它打印:
7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG
地獄,再增加三行,你甚至可以根據它們落在RNA序列中的位置對它們進行排序:
from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0))
最終印刷網前放置你:
007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG
我之前回答過類似的問題:無法用Python正則表達式afaik完成。在Perl中,你可以用一些技巧獲得所有可能的匹配。 – Qtax 2013-04-03 22:29:58
有一個[新的正則表達式Python模塊](https://pypi.python.org/pypi/regex)允許重疊匹配。 – ovgolovin 2013-04-03 22:34:31