As recommended通過Ignacio Vazquez-Abrams,使用a deque存儲最後ñ線。一旦出現許多行,就會添加每條新行的popleft。當您的正則表達式找到匹配項時,返回前面的n行,然後迭代n更多行並返回它們。
這使您無需在任何行上進行兩次(DRY)迭代,並只將最小數據存儲在內存中。您還提到了對Unicode的需求,因此處理文件編碼和向RegEx搜索添加Unicode標記非常重要。另外,其他答案使用re.match()而不是re.search(),因此可能會產生意想不到的後果。
下面是一個例子。這個例子只遍歷文件中的每一行ONCE,這意味着也包含命中的上下文行不會再被查看。這可能是也可能不是理想的行爲,但可以輕鬆地調整以突出顯示或以其他方式在上一次點擊的上下文中標記其他點擊的線條。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import re
from collections import deque
def grep(pattern, input_file, context=0, case_sensitivity=True, file_encoding='utf-8'):
stack = deque()
hits = []
lines_remaining = None
with codecs.open(input_file, mode='rb', encoding=file_encoding) as f:
for line in f:
# append next line to stack
stack.append(line)
# keep adding context after hit found (without popping off previous lines of context)
if lines_remaining and lines_remaining > 0:
continue # go to next line in file
elif lines_remaining and lines_remaining == 0:
hits.append(stack)
lines_remaining = None
stack = deque()
# if stack exceeds needed context, pop leftmost line off stack
# (but include current line with possible search hit if applicable)
if len(stack) > context+1:
last_line_removed = stack.popleft()
# search line for pattern
if case_sensitivity:
search_object = re.search(pattern, line, re.UNICODE)
else:
search_object = re.search(pattern, line, re.IGNORECASE|re.UNICODE)
if search_object:
lines_remaining = context
# in case there is not enough lines left in the file to provide trailing context
if lines_remaining and len(stack) > 0:
hits.append(stack)
# return list of deques containing hits with context
return hits # you'll probably want to format the output, this is just an example
添加樣本文本,你想從提取呢? – SIslam
https://docs.python.org/2/library/collections.html#collections.deque –
@SIslam文字無關緊要。我想要的是'grep -C'的功能。我沒有示例文本,我可以想出它,但這不是必需的,因爲該工具定義了功能。 – MatthewRock