2016-08-30 72 views
0

我是python的新手。我有一個大頭格式化的輸入文件,其中標題行以'>'開頭。我的文件是這樣的:從大的特定頭文件格式文件中提取信息

>NC_23689 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# No.    Start   End  Length 
# 1      1   25   25 
# 2      39   47   9 
# 
>68469409 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# None. 
# 
# Prediction profile output: 
# Columns: 
# 1 - Amino acid number 
# 2 - One letter code 
# 3 - probability value 
# 4 - output 
# 
1 M  0.1325  0 
2 S  0.1341  0 
3 S  0.1384  0 
>68464675 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# No.    Start   End  Length 
# 1      13   24   12 
# 2      31   53   23 
# 3      81   95   15 
# 4     115   164   50 
# 
... 
... 

我想提取每個頭及其相應的開始 - 結束值(S)在(output.txt的文件)(預測後的結合區線)。對於上述(input.txt中),輸出將是:

NC_23689: 1-25, 39-47 
68464675: 13-24, 31-53, 81-95, 115-164 

我曾嘗試:

with open('input.txt') as infile, open('output.txt', 'w') as outfile: 
    copy = False 
    for line in infile: 
     if line.strip() == ">+": 
      copy = True 
    elif line.strip() == "# No.    Start   End  Length": 
      copy = True 
     elif line.strip() == "#": 
      copy = False 
     elif copy: 
      outfile.write(line) 

但它給我:

# 1      1   25   25 
# 2      39   47   9 
# 1      13   24   12 
# 2      31   53   23 
# 3      81   95   15 
# 4     115   164   50 

這顯然是不正確的。我得到的範圍,但沒有頭描述符和一些額外的值。我怎樣才能得到我上面提到的輸出? 謝謝

Ps。我在我的Windows7機器上使用python 2.7。

回答

0

試試這個:

with open("file.txt") as f: 
    first_time = True 
    for line in f: 
     line = line.rstrip() 
     if line.startswith(">"): 
      if not first_time: 
       if start_ends: 
        print("{}: {}".format(header,", ".join(start_ends)))   
      else: 
       first_time = False  
      header = line.lstrip(">") 
      start_ends = [] 
     elif len(line.split()) == 5 and "".join(line.split()[1:]).isnumeric(): 
      start_ends.append("{}-{}".format(line.split()[2],line.split()[3])) 
    if start_ends: 
     print("{}: {}".format(header,", ".join(start_ends))) 

# Outputs: 
# NC_23689: 1-25, 39-47 
# 68464675: 13-24, 31-53, 81-95, 115-164 
+0

感謝克里斯,但我怎麼能運行在Python 2.7這個腳本? –

+0

@ J.Carter對於python2.7,你需要在這一行添加u,使其成爲unicode字符串:'elif len(line.split())== 5 and u「」。join(line.split )[1:])。isnumeric():' –

+0

謝謝..正是我需要的 –