如何僅恢復文本文件中字符串的第二個實例？

我有大量的文本文件（> 1000），所有格式都相同。如何僅恢復文本文件中字符串的第二個實例？

我感興趣的是該文件的部分看起來像：

# event 9 
num:  1 
length:  0.000000 
otherstuff: 19.9 18.8 17.7 
length: 0.000000 176.123456 

# event 10 
num:  1 
length:  0.000000 
otherstuff: 1.1 2.2 3.3 
length: 0.000000 1201.123456

我只需要定義的變量的第二個實例的第二指標值，在這種情況下的長度。有沒有這樣做的pythonic方式（即不是sed）？

我的代碼如下所示：

with open(wave_cat,'r') as catID: 
     for i, cat_line in enumerate(catID): 
      if not len(cat_line.strip()) == 0: 
       line = cat_line.split() 
       #replen = re.sub('length:','length0:','length:') 
       if line[0] == '#' and line[1] == 'event': 
        num = long(line[2]) 
       elif line[0] == 'length:': 
        Length = float(line[2])

來源

2015-11-07 scootie

這是一個文件的全部內容？ – beezz

不，每個文件有超過10個事件，但都是相同的格式。編輯：我已經改變了上面的文件格式。 – scootie

使用計數器：

with open(wave_cat,'r') as catID: 
    ct = 0 
    for i, cat_line in enumerate(catID): 
     if not len(cat_line.strip()) == 0: 
      line = cat_line.split() 
      #replen = re.sub('length:','length0:','length:') 
      if line[0] == '#' and line[1] == 'event': 
       num = long(line[2]) 
      elif line[0] == 'length:': 
       ct += 1 
       if ct == 2: 
        Length = float(line[2]) 
        ct = 0

來源

2015-11-07 16:37:33

工作，謝謝！ – scootie

你在正確的軌道上。除非你真的需要它，否則它可能會更快地推遲分裂。另外，如果您正在掃描大量文件並且只想要第二個長度條目，那麼一旦您看到它，它將節省大量時間以打破循環。

length_seen = 0 
elements = [] 
with open(wave_cat,'r') as catID: 
    for line in catID: 
     line = line.strip() 
     if not line: 
      continue 
     if line.startswith('# event'): 
      element = {'num': int(line.split()[2])} 
      elements.append(element) 
      length_seen = 0 
     elif line.startswith('length:'): 
      length_seen += 1 
      if length_seen == 2: 
       element['length'] = float(line.split()[2])

來源

2015-11-07 16:40:06 chthonicdaemon

這確實加快了速度，謝謝指出！我還在休息之前添加了length_seen = 0，因爲在單個文件中存在多個相同文本的副本。 – scootie

我已經修改它來構建文件的元素列表，包括數字和長度。 – chthonicdaemon

如果你可以看到整個文件到內存中，只是做一個regex against the file contents：

for fn in [list of your files, maybe from a glob]: 
    with open(fn) as f: 
     try: 
      nm=pat.findall(f.read())[1] 
     except IndexError: 
      nm='' 
     print nm

如果文件較大，使用mmap：

import re, mmap 

nth=1 
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M) 
for fn in [list of your files, maybe from a glob]: 
    with open(fn, 'r+b') as f: 
     mm = mmap.mmap(f.fileno(), 0) 
     for i, m in enumerate(pat.finditer(mm)): 
      if i==nth: 
       print m.group(1) 
       break

來源

2015-11-07 16:47:54 dawg

如何僅恢復文本文件中字符串的第二個實例？

回答

相關問題