反覆提取文本文件兩個分隔符之間的線，巨蟒

我有以下格式的文本文件：反覆提取文本文件兩個分隔符之間的線，巨蟒

DELIMITER1 
extract me 
extract me 
extract me 
DELIMITER2

我想提取的DELIMITER1和DELIMITER2之間extract me小號每個塊的.txt文件

這是我當前的非執行代碼：

import re 
def GetTheSentences(file): 
    fileContents = open(file) 
    start_rx = re.compile('DELIMITER') 
    end_rx = re.compile('DELIMITER2') 

    line_iterator = iter(fileContents) 
    start = False 
    for line in line_iterator: 
      if re.findall(start_rx, line): 

       start = True 
       break 
     while start: 
      next_line = next(line_iterator) 
      if re.findall(end_rx, next_line): 
       break 

      print next_line 

      continue 
     line_iterator.next()

任何想法？

來源

2011-08-17 Renklauf

您可以簡化這種使用re.S一個正則表達式，該DOTALL flag。

import re 
def GetTheSentences(infile): 
    with open(infile) as fp: 
     for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S): 
      print result 
# extract me 
# extract me 
# extract me

這還使得使用非貪婪操作.*?的，DELIMITER1-DELIMITER2對這樣的多個非重疊的塊都將被發現。

來源

2011-08-17 19:59:42

提示：如果您的文件太大而無法一次全部讀取，請將其與內存映射文件對象（通過'mmap'模塊）一起使用。 – Steven

@Brent嘗試了這一點，它的功能很好......謝謝！ – Renklauf

很高興我能幫到你。如果問題的答案最好，不要忘記標記答案。 –

這應該做你想要什麼：

import re 
def GetTheSentences(file): 
    start_rx = re.compile('DELIMITER') 
    end_rx = re.compile('DELIMITER2') 

    start = False 
    output = [] 
    with open(file, 'rb') as datafile: 
     for line in datafile.readlines(): 
      if re.match(start_rx, line): 
       start = True 
      elif re.match(end_rx, line): 
       start = False 
      if start: 
        output.append(line) 
    return output

你以前的版本看起來像它應該是一個迭代函數。你希望你的輸出一次返回一個項目嗎？這有點不同。

來源

2011-08-17 19:54:13

沒有必要將整個文件讀入內存。如果只是在一行中查找特定的子字符串，您也不需要正則表達式。 – agf

@agf當然不是，但他簡單的例子可能不完全符合他的數據。我對postscript文件做了一件非常類似的事情，我絕對必須爲我的開始和結束點設置正則表達式。 –

@everyone感謝所有的幫助！ – Renklauf

如果分隔符是行內：

def get_sentences(filename): 
    with open(filename) as file_contents: 
     d1, d2 = '.', ',' # just example delimiters 
     for line in file_contents: 
      i1, i2 = line.find(d1), line.find(d2) 
      if -1 < i1 < i2: 
       yield line[i1+1:i2] 


sentences = list(get_sentences('path/to/my/file'))

如果他們對自己的臺詞：

def get_sentences(filename): 
    with open(filename) as file_contents: 
     d1, d2 = '.', ',' # just example delimiters 
     results = [] 
     for line in file_contents: 
      if d1 in line: 
       results = [] 
      elif d2 in line: 
       yield results 
      else: 
       results.append(line) 

sentences = list(get_sentences('path/to/my/file'))

來源

2011-08-17 19:55:09 agf

回溯（最近通話最後一個）：文件「」，1號線，在文件「」，10號線，在get_sentences UnboundLocalError：局部變量「結果'前分配 – amadain

@amadain我添加了一行來初始化結果，但看着這個我不知道它是否正確無論如何。 – agf

這對於列表推導是一件好事，不需要正則表達式。第一個列表comp將在打開txt文件時找到的文本行列表中刪除典型的\n。第二個列表comp僅使用in運算符來標識要過濾的序列模式。

def extract_lines(file): 
    scrubbed = [x.strip('\n') for x in open(file, 'r')] 
    return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]

來源

2015-05-10 05:00:01 cheekybastard

反覆提取文本文件兩個分隔符之間的線，巨蟒

回答

相關問題