通過文件重複標題之間的提取行

我想修改一個〜43k行的txt文件。在文件中給出命令* Nset後，我需要提取並保存該命令後面的所有行，並在文件中的下一個*命令時停止。在每個命令之後有不同數量的行和字符。舉例來說，這裏的文件的樣本部分：通過文件重複標題之間的提取行

*Nset 

1, 2, 3, 4, 5, 6, 7, 

12, 13, 14, 15, 16, 

17, 52, 75, 86, 92, 

90, 91, 92 93, 94, 95.... 

*NEXT COMMAND 

blah blah blah 

*Nset 

numbers 

*Nset 

numbers 

*Command 

irrelevant text

我現在有工作的代碼時，我需要的數字是不是之間有兩個* N set個的。當一個* Nset跟隨另一個的數字時，它會跳過該命令和程序線，我不知道爲什麼。當下一個命令不是* Nset時，它會找到下一個命令並將數據完美地提取出來。

import re 

# read in the input deck 
deck_name = 'master.txt' 
deck = open(deck_name,'r') 

#initialize variables 
nset_data = [] 
matched_nset_lines = [] 
nset_count = 0 

for line in deck: 
    # loop to extract all nset names and node numbers 
    important_line = re.search(r'\*Nset,.*',line) 
    if important_line : 
     line_value = important_line.group() #name for nset 
     matched_nset_lines.insert(nset_count,line_value) #name for nset 
     temp = [] 

     # read lines from the found match up until the next *command 
     for line_x in deck : 
      if not re.match(r'\*',line_x): 
       temp.append(line_x) 
      else : 
       break 

     nset_data.append(temp) 

    nset_count = nset_count + 1

我正在使用Python 3.5。謝謝你的幫助。

來源

2017-07-05 K. Gibboney

是否有一個命令* always *在一行的開頭，以'「*」'開始？ –

@ juanpa.arrivillaga，是的。有各種各樣的命令，但是在每個命令之前是「*」。然後下一行是數字。 –

這可能與所有相關嗎？ https://stackoverflow.com/questions/25943000/finding-a-word-between-two-words-that-will-not-match-if-the-closing-word-occurs –

如果你只是想提取*Nsets之間的界限以下辦法應該工作：

In [5]: with open("master.txt") as f: 
    ...:  data = [] 
    ...:  gather = False 
    ...:  for line in f: 
    ...:   line = line.strip() 
    ...:   if line.startswith("*Nset"): 
    ...:    gather = True 
    ...:   elif line.startswith("*"): 
    ...:    gather = False 
    ...:   elif line and gather: 
    ...:    data.append(line) 
    ...: 

In [6]: data 
Out[6]: 
['1, 2, 3, 4, 5, 6, 7,', 
'12, 13, 14, 15, 16,', 
'17, 52, 75, 86, 92,', 
'90, 91, 92 93, 94, 95....', 
'numbers', 
'numbers']

而且，如果你想要更多的信息，這是很簡單的延長上面：

In [7]: with open("master.txt") as f: 
    ...:  nset_lines = [] 
    ...:  nset_count = 0 
    ...:  data = [] 
    ...:  gather = False 
    ...:  for i, line in enumerate(f): 
    ...:   line = line.strip() 
    ...:   if line.startswith("*Nset"): 
    ...:    gather = True 
    ...:    nset_lines.append(i) 
    ...:    nset_count += 1 
    ...:   elif line.startswith("*"): 
    ...:    gather = False 
    ...:   elif line and gather: 
    ...:    data.append(line) 
    ...: 

In [8]: nset_lines 
Out[8]: [0, 14, 18] 

In [9]: nset_count 
Out[9]: 3 

In [10]: data 
Out[10]: 
['1, 2, 3, 4, 5, 6, 7,', 
'12, 13, 14, 15, 16,', 
'17, 52, 75, 86, 92,', 
'90, 91, 92 93, 94, 95....', 
'numbers', 
'numbers']

來源

2017-07-05 19:37:21

這是做你想做的。

command = [] 
commandLines = [] 

with open('test.txt') as file: 
    for line in file: 
     if line.startswith('*'): 
      command.append(line.rstrip()) 
      commandLines.append([]) 
     else: 
      commandLines[-1].append(line.rstrip()) 

import pprint 

pprint.pprint(command) 
pprint.pprint(commandLines)

commandLines[i]是含有對應於command[i]線的列表。

打印出來的命令：

['*Nset', '*NEXT COMMAND', '*Nset', '*Nset', '*Command']

而且線（嵌套表）：

[['1, 2, 3, 4, 5, 6, 7,', 
    '12, 13, 14, 15, 16,', 
    '17, 52, 75, 86, 92,', 
    '90, 91, 92 93, 94, 95....'], 
['blah blah blah'], 
['numbers'], 
['numbers'], 
['irrelevant text']]

假設：只有命令行以 '*'。

來源

2017-07-05 19:35:06

通過文件重複標題之間的提取行

回答

相關問題