2017-06-03 99 views
-1

正則表達式我有以下文件命名seq.fasta:文本(蟒蛇)的多塊

>AAM15934.1| NtrX [Gluconacetobacter diazotrophicus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8 
MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI 
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA 
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL 
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN 
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY 
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL 
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG 

>WP_002731145.1| NtrX [Phaeospirillum molischianum]| NTRX1 | Response_reg - Sigma54_activat - HTH_8 
MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI 
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG 
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE 
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA 
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL 
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER 
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR 

>WP_002967695.1| NtrX [Brucella abortus]| NTRX1 | Response_reg - Sigma54_activat - HTH_8 
MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI 
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL 
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE 
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ 
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS 
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI 
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV 

我想提出的每個字母塊在列表中。 例子:

列出內容:

List[0] = MGHEILIVDDEPDIRLLVEGILRDEGYETRLAGDSDSAISAFRARRPSLVILDVWLQGSRLDGLGILQAI 
QGEEPVVPTIMISGHGTIETAVAALQHGAYDFIEKPFQSDRLLLVVRRALEASRLARENAELRLRAGPEA 
MLYGDSPVIAGVRNQIERVAPSGSRVLISGAAGAGKEVAARMIHARSPGPKAFIALNCATLAPGRFEEEL 
FGIEGAPDGTGRRTGVLERAHGGTLLLDEVSDMPIETQGKIVRALQDQSFERVGGASRVKVDVRVLAATN 
RDLQEAIAAGRFREDLYYRLAVVPLRVPSLRERREDIPGLARLFLRRAAENAGLPLRDLSGDAVAALQSY 
DWPGNARELRNLMERLLIMMPGNGSDLIRAEMLPPSVGQGAPALLKFDPAADVMGLPLREARDLFETQYL 
QAQLLRFGGNISRTAGFVGMERSALHRKLKQLGVTSEERGAG 

List[1] = MAHDILIVDDEADIRVLIAGILEDEGHSTREAANADEALERIRARRPSLVIQDIWLQGSRLDGLGVLDEI 
KREHPDVPVVMISGHGTIETAVQAIKQGAYDFIEKPFKADRLLLVVDRAIESARLKRENQELRVRSGSTG 
DLVGISPALVQIRQTIERVAPTNSRVLITGPAGSGKEVAARMIHAHSRRTEGPFVVVNCAAMHPDRMEIE 
LFGTEYGADGSTSPRKIGTFEQAHSGTLLLDEVADMPLETQGKIVRVLQDQTFERVGGGKRVEVDVRVIA 
TTNRDLQSEMIAGHFREDLFYRLNVVPIRMPALRDGKEDIPLLARQFMQLAAQLAGVPPRPLGEDALAAL 
QAYDWPGNVRQLRNAIDWLLIMAPGDWRDPVRADMLPSEIGAITPAVLRWEKSSEIMTLPLREARELFER 
EYLLAQVNRFAGNISRTAAFVGMERSALHRKLKLLGINTDEKVR 

List[2] = MAADILVVDDEVDIRDLVAGILSDEGHETRTAFDADSALAAINDRAPRLVFLDIWLQGSRLDGLALLDEI 
KKQHPELPVVMISGHGNIETAVSAIRRGAYDFIEKPFKADRLILVAERALETSKLKREVSDLRKRTGDQL 
ELVGTSLAMNQLRQTIERVAPTNSRIMITGPSGAGKELVARTIHAQSSRANGPFVTVNAATITPERMEIE 
LFGTEMDGGERKVGALEEAHGGILYLDEVADMPRETQNKILRVLVDQQFERVGGTKRVKVDVRIISSTAQ 
NLEGMIAEGTFREDLFHRLSVVPVQVPALAARREDIPSLVEFFMKQIAEQAGIKPRKIGPDAMAVLQAHS 
WPGNLRQLRNNVERLMILTRGDDPDELVTADLLPAEIGDTLPRAPTESDQHIMALPLREARERFEKEYLI 
AQINRFGGNISRTAEFVGMERSALHRKLKSLGV 

但我掙扎分裂,並把它們在列表中,我的代碼是這樣的:

import re 

myfile = open('seq.fasta', 'r').read() 

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE) 
matches = [m.groups() for m in regex.finditer(myfile)] 

for m in matches: 
    onlySequences = (m[1]) 

print(onlySequences) 

變量onlySequences返回剛剛過去的一個字母塊,我如何保留所有的人,每個人都在一個列表中?

+1

您每次迭代'matches'時都要重寫'onlySequences'。 –

+0

你不需要regex來做到這一點。 –

回答

0

您在for循環中重寫onlySequences。在你的代碼

matches = [m.groups()[1] for m in regex.finditer(myfile)] 
print(matches) 

或更正:也許你只需要這個

matches = [m.groups() for m in regex.finditer(myfile)] 
onlySequences = [m[1] for m in matches] 
0

你不需要正則表達式來做到這一點。一種更好的方法是逐行讀取文件:

with open('seq.fasta', 'r') as fh: 
    result = [] 
    for line in fh: 
     if line.startswith('>'): 
      temp = '' 
     elif not line.strip(): 
      result.append(temp) 
     else: 
      temp = temp + line 

    if temp: 
     result.append(temp) 

    print("\n".join(result))