2013-07-27 131 views
0

我有一個文本文件:刪除從文本文件中的特定行

>E8|E2|E9D 
Football is a good game 
Its good for health 
you can play it every day 
>E8|E2|E10D 
Sequence unavailable 
>E8|E2|EKB 
Cricket 
>E87|E77|E10D 
Sequence unavailable 
>E27|E97|E10D 
Sequence unavailable 
>E8|E2|E9D 
Sequence unavailable 

我寫了下面的代碼從這個文件檢測Sequence unavailable並將其刪除:

with open('input.txt') as f1, open('output.txt', 'w') as f2,\ 
                open('temp_file','w') as f3: 
    lines = []  # store lines between two `>` in this list 
    for line in f1: 
     if line.startswith('>'): 
      if lines: 
       f3.writelines(lines) 
       lines = [line] 
      else: 
       lines.append(line) 
     elif line.rstrip('\n') == 'Sequence unavailable': 
      f2.writelines(lines + [line]) 
      lines = [] 
     else: 
      lines.append(line) 

    f3.writelines(lines) 

os.remove('input.txt') 
os.rename('temp_file', 'input.txt') 

但我真正想要的是我刪除了給定問題的所有可用序列(>行的最後一列)。

例如,即使有以下E9D行,如果沒有與Sequence unavailableE9D另一個條目沒有條目應該被寫入到輸出文件:

input.txt中

>E8|E2|E9D 
Football is a good game 
Its good for health 
you can play it every day 
>E8|E2|E10D 
Sequence unavailable 
>E8|E2|EKB 
Cricket 
>E87|E77|E10D 
Sequence unavailable 
>E27|E97|E10D 
Sequence unavailable 
>E8|E2|E9D 
Sequence unavailable 

輸出。 txt

>E8|E2|EKB 
Cricket 

這裏只有EKB問題有條目。

+0

感謝@Martijn彼得斯,讓它很容易理解 – Rocket

回答

1
def get_name(line): 
    return line[1:].rsplit('|', 1)[-1].strip() 

with open('input.txt') as f, open('output.txt', 'w') as fout: 
    name = '' 

    # Phase 1: Find unavailable sequence 
    unavailable = set() 
    for line in f: 
     if line.startswith('>'): 
      name = get_name(line) 
     else: 
      if 'Sequence unavailable' in line: 
       unavailable.add(name) 

    # Phase 2: Filter avilable sequence 
    f.seek(0) 
    keep = False 
    for line in f: 
     if line.startswith('>'): 
      name = get_name(line) 
      keep = name not in unavailable 
     if keep: 
      fout.write(line) 
+0

它沒有在我的實際數據,這是在大的形式,我試圖將其粘貼在這裏 – Rocket

+0

我編輯在我的問題 – Rocket

+0

@Angel,我更新了代碼。 – falsetru

0

您可以按照另一種更簡單的方法。而不是刪除行,你可以用「」替換它「」

import fileinput 
import sys 

f=open('input.txt') 
line = f.readline() 
f.close() 
words = line.split() 
for word in words: 
    line = line.replace("Sequence unavailable","") 
    line = line.replace("\n","") 

我還沒有執行此代碼,但我認爲邏輯是正確的。請注意,您必須使用第二次替換,因爲每次都會有新的一行。