2015-12-02 58 views
1

所以現在我正在尋找一個文件中的東西。我得到一個value變量,這是一個相當長的字符串,帶有換行符等等。然後,我使用re.findall(regex,value)來查找正則表達式。正則表達式很簡單 - 就像「abc de。*」。'grep -C N'的Python等價物?

現在,我不僅要捕獲任何正則表達式,而且還要獲取上下文(與grep-C標誌完全相同)。

因此,假設我甩value到文件,並運行grep的就可以了,我會做什麼grep -C N 'abc de .*' valueinfile

我怎樣才能實現在Python是一回事嗎?我需要使用Unicode正則表達式/文本的答案。

+0

添加樣本文本,你想從提取呢? – SIslam

+0

https://docs.python.org/2/library/collections.html#collections.deque –

+0

@SIslam文字無關緊要。我想要的是'grep -C'的功能。我沒有示例文本,我可以想出它,但這不是必需的,因爲該工具定義了功能。 – MatthewRock

回答

2

我的方法是將文本塊分割成行列表。接下來,遍歷每一行,看看是否有匹配。在匹配的情況下,然後收集上下文行(在當前行之前和之後發生的行)並將其返回。這裏是我的代碼:

import re 

def grep(pattern, block, context_lines=0): 
    lines = block.splitlines() 
    for line_number, line in enumerate(lines): 
     if re.match(pattern, line): 
      lines_with_context = lines[line_number - context_lines:line_number + context_lines + 1] 
      yield '\n'.join(lines_with_context) 

# Try it out 
text_block = """One 
Two 
Three 
abc defg 
four 
five 
six 
abc defoobar 
seven 
eight 
abc de""" 

pattern = 'abc de.*' 

for line in grep(pattern, text_block, context_lines=2): 
    print line 
    print '---' 

輸出:

Two 
Three 
abc defg 
four 
five 
--- 
five 
six 
abc defoobar 
seven 
eight 
--- 
seven 
eight 
abc de 
--- 
+0

太好了。要試一下,看看它是否有效。 – MatthewRock

0

As recommended通過Ignacio Vazquez-Abrams,使用a deque存儲最後ñ線。一旦出現許多行,就會添加每條新行的popleft。當您的正則表達式找到匹配項時,返回前面的n行,然後迭代n更多行並返回它們。

這使您無需在任何行上進行兩次(DRY)迭代,並只將最小數據存儲在內存中。您還提到了對Unicode的需求,因此處理文件編碼和向RegEx搜索添加Unicode標記非常重要。另外,其他答案使用re.match()而不是re.search(),因此可能會產生意想不到的後果。

下面是一個例子。這個例子只遍歷文件中的每一行ONCE,這意味着也包含命中的上下文行不會再被查看。這可能是也可能不是理想的行爲,但可以輕鬆地調整以突出顯示或以其他方式在上一次點擊的上下文中標記其他點擊的線條。

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import codecs 
import re 
from collections import deque 

def grep(pattern, input_file, context=0, case_sensitivity=True, file_encoding='utf-8'): 
    stack = deque() 
    hits = [] 
    lines_remaining = None 

    with codecs.open(input_file, mode='rb', encoding=file_encoding) as f: 
     for line in f: 
      # append next line to stack 
      stack.append(line) 

      # keep adding context after hit found (without popping off previous lines of context) 
      if lines_remaining and lines_remaining > 0: 
       continue # go to next line in file 
      elif lines_remaining and lines_remaining == 0: 
       hits.append(stack) 
       lines_remaining = None 
       stack = deque() 

      # if stack exceeds needed context, pop leftmost line off stack 
      # (but include current line with possible search hit if applicable) 
      if len(stack) > context+1: 
       last_line_removed = stack.popleft() 

      # search line for pattern 
      if case_sensitivity: 
       search_object = re.search(pattern, line, re.UNICODE) 
      else: 
       search_object = re.search(pattern, line, re.IGNORECASE|re.UNICODE) 

      if search_object: 
       lines_remaining = context 

    # in case there is not enough lines left in the file to provide trailing context 
    if lines_remaining and len(stack) > 0: 
     hits.append(stack) 

    # return list of deques containing hits with context 
    return hits # you'll probably want to format the output, this is just an example