2015-11-07 21 views
0

我有大量的文本文件(> 1000),所有格式都相同。如何僅恢復文本文件中字符串的第二個實例?

我感興趣的是該文件的部分看起來像:

# event 9 
num:  1 
length:  0.000000 
otherstuff: 19.9 18.8 17.7 
length: 0.000000 176.123456 

# event 10 
num:  1 
length:  0.000000 
otherstuff: 1.1 2.2 3.3 
length: 0.000000 1201.123456 

我只需要定義的變量的第二個實例的第二指標值,在這種情況下的長度。有沒有這樣做的pythonic方式(即不是sed)?

我的代碼如下所示:

with open(wave_cat,'r') as catID: 
     for i, cat_line in enumerate(catID): 
      if not len(cat_line.strip()) == 0: 
       line = cat_line.split() 
       #replen = re.sub('length:','length0:','length:') 
       if line[0] == '#' and line[1] == 'event': 
        num = long(line[2]) 
       elif line[0] == 'length:': 
        Length = float(line[2]) 
+0

這是一個文件的全部內容? – beezz

+0

不,每個文件有超過10個事件,但都是相同的格式。編輯:我已經改變了上面的文件格式。 – scootie

回答

0

使用計數器:

with open(wave_cat,'r') as catID: 
    ct = 0 
    for i, cat_line in enumerate(catID): 
     if not len(cat_line.strip()) == 0: 
      line = cat_line.split() 
      #replen = re.sub('length:','length0:','length:') 
      if line[0] == '#' and line[1] == 'event': 
       num = long(line[2]) 
      elif line[0] == 'length:': 
       ct += 1 
       if ct == 2: 
        Length = float(line[2]) 
        ct = 0 
+0

工作,謝謝! – scootie

0

你在正確的軌道上。除非你真的需要它,否則它可能會更快地推遲分裂。另外,如果您正在掃描大量文件並且只想要第二個長度條目,那麼一旦您看到它,它將節省大量時間以打破循環。

length_seen = 0 
elements = [] 
with open(wave_cat,'r') as catID: 
    for line in catID: 
     line = line.strip() 
     if not line: 
      continue 
     if line.startswith('# event'): 
      element = {'num': int(line.split()[2])} 
      elements.append(element) 
      length_seen = 0 
     elif line.startswith('length:'): 
      length_seen += 1 
      if length_seen == 2: 
       element['length'] = float(line.split()[2]) 
+0

這確實加快了速度,謝謝指出!我還在休息之前添加了length_seen = 0,因爲在單個文件中存在多個相同文本的副本。 – scootie

+0

我已經修改它來構建文件的元素列表,包括數字和長度。 – chthonicdaemon

1

如果你可以看到整個文件到內存中,只是做一個regex against the file contents

for fn in [list of your files, maybe from a glob]: 
    with open(fn) as f: 
     try: 
      nm=pat.findall(f.read())[1] 
     except IndexError: 
      nm='' 
     print nm 

如果文件較大,使用mmap:

import re, mmap 

nth=1 
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M) 
for fn in [list of your files, maybe from a glob]: 
    with open(fn, 'r+b') as f: 
     mm = mmap.mmap(f.fileno(), 0) 
     for i, m in enumerate(pat.finditer(mm)): 
      if i==nth: 
       print m.group(1) 
       break 
相關問題