2013-03-19 23 views
0

我不得不在其上含有數據的巨大的文本文件的工作如由space.It分割塊如下:的Python:讀取在塊文本文件時每個組塊的大小是未知

>3D_helix;140 
protein_name:AChR pore alpha subunit (Torpedo marmorata) 
file_name:ACh_pore_alpha.txt 
entry_date:3july03 
refman_number:21022 
endnote_number: 
author:Miyazawa,A., Fujiyoshi,Y., Unwin,N.(2003) [Structure and gating mechanism of the acetylcholine receptor pore] {Nature, 423, 949-955} 
remarks:Sequence is from PDB, chain A. There is additional 24 AA as signal sequence in Swiss-Prot. TMhelices=4. 
pir_number: 
Swiss_Prot_entry:ACHA_TORMA 
Swiss_Prot_number:P02711 
Swiss_Prot_gene:CHRNA1 
Swiss_Prot_name:Acetylcholine receptor subunit alpha 
PDB_title:Acetylcholine Receptor Protein, alpha Chain 
PDB_Identifier:1OED 
N_terminal:in 
number_tmsegs:4 
tm_segments:A.211,237;B.243,271;C.275,300;D.403,436 
sequence:SEHETRLVANLLENYNKVIRPVEHHTHFVDITVGLQLIQLINVDEVNQIVETNVRLRQQWIDVRLRWNPADYGGIKKIRLPSDDVWLPDLVLYNNADGDFAIVHMTKLLLDYTGKIMWTPPAIFKSYCEIIVTHFPFDQQNCTMKLGIWTYDGTKVSISPESDRPDLSTFMESGEWVMKDYRGWKHWVYYTCCPDTPYLDITYHFIMQRIPLYFVVNVIIPCLLFSFLTVLVFYLPTDSGEKMTLSISVLLSLTVFLLVIVELIPSTSSAVPLIGKYMLFTMIFVISSIIVTVVVINTHHRSPSTHTMPQWVRKIFINTIPNVMFFSTMKRASKEKQENKIFADDIDISDISGKQVTGEVIFQTPLIKNPDVKSAIEGVKYIAEHMKSDEESSNAAEEWKYVAMVIDHILLCVFMLICIIGTVSVFAGRLIELSQEG* 

>1D_helix;141 
protein_name:AChR pore beta subunit (Torpedo marmorata) 
file_name:ACh_pore_beta.txt 
entry_date:3july03 
refman_number:21022 
endnote_number: 
author:Miyazawa,A., Fujiyoshi,Y., Unwin,N.(2003) [Structure and gating mechanism of the acetylcholine receptor pore] {Nature, 423, 949-955} 
remarks:Sequence is from PDB, chain B. There is additional 24 AA as signal sequence in Swiss-Prot. TMhelices=4. 
pir_number: 
Swiss_Prot_entry:Q6S3I0_TORMA 
Swiss_Prot_number:Q6S3I0 
Swiss_Prot_gene:none 
Swiss_Prot_name:Acetylcholine receptor beta subunit 
PDB_title:Acetylcholine Receptor Protein, beta Chain 
PDB_Identifier:1OED 
N_terminal:in 
number_tmsegs:4 
tm_segments:A.224,241;B.249,274;C.290,306;D.438,462 
sequence:SVMEDTLLSVLFENYNPKVRPSQTVGDKVTVRVGLTLTSLLILNEKNEEMTTSVFLNLAWTDYRLQWDPAAYEGIKDLSIPSDDVWQPDIVLMNNNDGSFEITLHVNVLVQHTGAVSWHPSAIYRSSCTIKVMYFPFDWQNCTMVFKSYTYDTSEVILQHALDAKGEREVKEIMINQDAFTENGQWSIEHKPSRKNWRSDDPSYEDVTFYLIIQRKPLFYIVYTIVPCILISILAILVFYLPPDAGEKMSLSISALLALTVFLLLLADKVPETSLSVPIIISYLMFIMILVAFSVILSVVVLNLHHRSPNTHTMPNWIRQIFIETLPPFLWIQRPVTTPSPDSKPTIISRANDEYFIRKPAGDFVCPVDNARVAVQPERLFSEMKWHLNGLTQPVTLPQDLKEAVEAIKYIAEQLESASEFDDLKKDWQYVAMVADRLFLYIFITMCSIGTFSIFLDASHNVPPDNPFA* 

>3D_other;143 
protein_name:AChR pore delta subunit (Torpedo marmorata) 
file_name:ACh_pore_delta.txt 
entry_date:4dec03 
refman_number:21022 
endnote_number: 
author:Miyazawa,A., Fujiyoshi,Y., Unwin,N.(2003) [Structure and gating mechanism of the acetylcholine receptor pore] {Nature, 423, 949-955} 
remarks:Sequence is from PDB, chain C. Sequence in PDB has first 21 AA removed relative to Swiss-Prot. TMhelices=4. 
pir_number: 
Swiss_Prot_entry:Q6S3H8_TORMA 
Swiss_Prot_number:Q6S3H8 
Swiss_Prot_gene:none 
Swiss_Prot_name:Acetylcholine receptor delta subunit 
PDB_title:Acetylcholine Receptor Protein, delta Chain 
PDB_Identifier:1OED 
N_terminal:in 
number_tmsegs:4 
tm_segments:A.226,253;B.257,285;C.289,316;D.452,483 
sequence:VNEEERLINDLLIVNKYNKHVRPVKHNNEVVNIALSLTLSNLISLKETDETLTTNVWMDHAWYDHRLTWNASEYSDISILRLRPELIWIPDIVLQNNNDGQYNVAYFCNVLVRPNGYVTWLPPAIFRSSCPINVLYFPFDWQNCSLKFTALNYNANEISMDLMTDTIDGKDYPIEWIIIDPEAFTENGEWEIIHKPAKKNIYGDKFPNGTNYQDVTFYLIIRRKPLFYVINFITPCVLISFLAALAFYLPAESGEKMSTAICVLLAQAVFLLLTSQRLPETALAVPLIGKYLMFIMSLVTGVVVNCGIVLNFHFRTPSTHVLSTRVKQIFLEKLPRILHMSRVDEIEQPDWQNDLKLRRSSSVGYISKAQEYFNIKSRSELMFEKQSERHGLVPRVTPRIGFGNNNENIAASDQLHDEIKSGIDSTNYIVKQIKEKNAYDEEVGNWNLVGQTIDRLSMFIITPVMVLGTIFIFVMGNFNRPPAKPFEGDPFDYSSDHPRCA 

每個塊從3個給定選項中的任一個開始。每個塊中的行數是多種多樣的。我想要分割的文件分成3份(或3單獨的文件),使得:

part 1 contains all blocks starting with >3D_Helix 
part 2 contains all blocks starting with >1D_helix 
part 3 contains all blocks starting with >3d_other 

我嘗試以下方法

prot_file = open(sys.argv[1], "r") 
flag = False 
for line in prot_file: 
    if line.startswith (">3D_other"): 
     flag == True 
    if flag == True: 
      print line 

但它僅打印第一線即3d_helix。我在網上發現的大多數提示都根據每個塊的大小將列表分成塊(即已知大小固定在某個特定的數字,例如13)。但在我的情況下,我不知道大小,因此不能使用它們。我想要一個有效的pythonic方法來按照解釋的方式劃分文件。

回答

1

這是我想出瞭解決方案:

#!/usr/bin/env python 

INPUT_FILE = 'input.txt' 
OUT_3D_HELIX = 'out_3dhelix.txt' 
OUT_1D_HELIX = 'out_1dhelix.txt' 
OUT_3D_OTHER = 'out_3dother.txt' 

f_input = open(INPUT_FILE, 'r') 
out_3dhelix = open(OUT_3D_HELIX, 'w') 
out_1dhelix = open(OUT_1D_HELIX, 'w') 
out_3dother = open(OUT_3D_OTHER, 'w') 

dest_file = None 
starting = True 

try: 
    for line in f_input: 
     if starting: 
      ## We are at a block start 
      if line.startswith('>3D_helix;'): 
       dest_file = out_3dhelix 
      elif line.startswith('>1D_helix;'): 
       dest_file = out_1dhelix 
      elif line.startswith('>3D_other;'): 
       dest_file = out_3dother 
      else: 
       continue # Invalid line -- not a block beginning 
      starting = False 

     if not line.strip(): # Line is blank -- block end 
      starting = True 
      dest_file = None 
      continue 

     if dest_file is not None: # And never should be, at this point.. 
      dest_file.write(line) 

finally: 
    ## Close files... 
    f_input.close() 
    out_3dhelix.close() 
    out_1dhelix.close() 
    out_3dother.close() 

基本上,它讀取所有的文件中的行由行,以便改變要在其中寫入所述目標文件檢測「塊起動器」以下行。