從大的特定頭文件格式文件中提取信息

我是python的新手。我有一個大頭格式化的輸入文件，其中標題行以'>'開頭。我的文件是這樣的：從大的特定頭文件格式文件中提取信息

>NC_23689 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# No.    Start   End  Length 
# 1      1   25   25 
# 2      39   47   9 
# 
>68469409 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# None. 
# 
# Prediction profile output: 
# Columns: 
# 1 - Amino acid number 
# 2 - One letter code 
# 3 - probability value 
# 4 - output 
# 
1 M  0.1325  0 
2 S  0.1341  0 
3 S  0.1384  0 
>68464675 
# 
# XYZ 
# Copyright (c) BLASC 
# 
# Predicted binding regions 
# No.    Start   End  Length 
# 1      13   24   12 
# 2      31   53   23 
# 3      81   95   15 
# 4     115   164   50 
# 
... 
...

我想提取每個頭及其相應的開始 - 結束值（S）在（output.txt的文件）（預測後的結合區線）。對於上述（input.txt中），輸出將是：

NC_23689: 1-25, 39-47 
68464675: 13-24, 31-53, 81-95, 115-164

我曾嘗試：

with open('input.txt') as infile, open('output.txt', 'w') as outfile: 
    copy = False 
    for line in infile: 
     if line.strip() == ">+": 
      copy = True 
    elif line.strip() == "# No.    Start   End  Length": 
      copy = True 
     elif line.strip() == "#": 
      copy = False 
     elif copy: 
      outfile.write(line)

但它給我：

# 1      1   25   25 
# 2      39   47   9 
# 1      13   24   12 
# 2      31   53   23 
# 3      81   95   15 
# 4     115   164   50

這顯然是不正確的。我得到的範圍，但沒有頭描述符和一些額外的值。我怎樣才能得到我上面提到的輸出？謝謝

Ps。我在我的Windows7機器上使用python 2.7。

來源

2016-08-30 J.Carter

試試這個：

with open("file.txt") as f: 
    first_time = True 
    for line in f: 
     line = line.rstrip() 
     if line.startswith(">"): 
      if not first_time: 
       if start_ends: 
        print("{}: {}".format(header,", ".join(start_ends)))   
      else: 
       first_time = False  
      header = line.lstrip(">") 
      start_ends = [] 
     elif len(line.split()) == 5 and "".join(line.split()[1:]).isnumeric(): 
      start_ends.append("{}-{}".format(line.split()[2],line.split()[3])) 
    if start_ends: 
     print("{}: {}".format(header,", ".join(start_ends))) 

# Outputs: 
# NC_23689: 1-25, 39-47 
# 68464675: 13-24, 31-53, 81-95, 115-164

來源

2016-08-30 10:32:49

感謝克里斯，但我怎麼能運行在Python 2.7這個腳本？ –

@ J.Carter對於python2.7，你需要在這一行添加u，使其成爲unicode字符串：'elif len（line.split（））== 5 and u「」。join（line.split ）[1：]）。isnumeric（）：' –

謝謝..正是我需要的 –

從大的特定頭文件格式文件中提取信息

回答

相關問題