2014-06-23 166 views
-3

我在掃描文本文件方面遇到了一個棘手的搜索場景,並希望得到一些關於如何處理該場景的最佳方法,或者通過分解它或任何有用的模塊。我在下面的例子中有一個文本文件,我正在尋找一個像「test1(OK)test2(OK)」這樣的文本序列。如果符合此搜索模式,則需要返回文件並查找另一個字符串「String Group A」的最後4個條目,並從這些以前的字符串組中捕獲「A的有用信息」信息。爲了讓事情變得更加困難,我有類似的'B'信息組,這使得事情變得棘手,我必須對所有Group'B'信息進行相同的處理!使用Python在txt文件中搜索多個字符串

String Group A 
    Useful information for A 

String Group A 
    Useful information for A 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for B」 from 「String Group B」 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

And so on… 

就像我說的,我找上最佳的前進道路的想法,在這個文本文件中收集信息似乎跳來跳去太多。我有一個想法,將'String Group A'視爲line(x),然後當「test1(OK)test2(OK)」條件滿足返回到line(x)和line(x-1)時,和線(x-2)和線(x-3),並抓住每個「A的有用信息」,但我不相信這是最好的前進方向。我要指出的是,文本文件是巨大的,所包含的條目1000絃樂A組和B

感謝您的閱讀,

MikG

+0

你有興趣的任何事件 '有用的信息爲' 前 'TEST1(OK)TEST2(OK)' 和 '其他主要字符串A' 或只那些之前? – Praxeolitic

+0

Hi Praxeolitic,我只查找'字符串組A'的一次'test1(OK)test2(OK)'爲'A'的前4條目,同樣我需要重複'B'條件 – MikG

回答

1

下面介紹如何定義一個循環矢量類,該類只跟蹤從上到下處理文件時可能需要的數據。它有相當數量的評論,因此它可以被理解,而不僅僅是代碼轉儲。解析的細節當然強烈依賴於你的輸入是什麼樣子。我的代碼根據您可能需要更改的示例文件進行假設。例如,使用startswith()可能過於僵化,這取決於您的輸入,您可能需要使用find()。

代碼

from __future__ import print_function 
import sys 
from itertools import chain 

class circ_vec(object): 
    """A circular fixed vector. 
    """ 
    # The use of slots drastically reduces memory footprint of Python classes - 
    # it removes the need for a hash table for every object 
    __slots__ = ['end', 'elems', 'capacity'] 
    # end will keep track of where the next element is to be added 
    # elems holds the last X elemenst that were added 
    # capacity is how many elements we will hold 

    def __init__(self, capacity): 
     # we only need to specify the capacity up front 
     # elems is empty 
     self.end = 0 
     self.elems = [] 
     self.capacity = capacity 

    def add(self, e): 
     new_index = self.end 
     if new_index < len(self.elems): 
      self.elems[new_index] = e 
     else: 
      # If we haven't seen capacity # of elements yet just append 
      self.elems.append(e) 
     self.end = (self.end + 1) % self.capacity 

    def __len__(self): 
     return len(self.elems) 

    # This magic method allows brace [ ] indexing 
    def __getitem__(self, index): 
     if index >= len(self.elems): 
      print("MY RAISE") 
      raise IndexError 
     first = self.capacity - self.end - 1 
     index = (index + first) % self.capacity 
     # index = (self.end + key) % self.capacity 
     # print("LEN = ", len(self.elems)) 
     # print("INDEX = ", index) 
     return self.elems[index] 

    # This magic method allows iteration 
    def __iter__(self): 
     if not self.elems: 
      return iter([]) 
     elif len(self.elems) < self.capacity: 
      first = 0 
     else: 
      first = self.end 
     # Iterate from the oldest element to the newest 
     return chain(iter(self.elems[first:]), iter(self.elems[:first])) 

string_group_last_four = { k : circ_vec(4) for k in ['A', 'B'] } 
with open(sys.argv[1], 'r') as f: 
    string_group_context = None 
    # We will manually iterate through the file. Get an iterator using iter(). 
    it = iter(f) 
    # As per the example, the file we're reading groups lines in twos. 
    buf = circ_vec(2) 
    try: 
     while(True): 
      line = next(it) 
      buf.add(line.strip()) 
      # The lines beginning with 'String Group' should be recorded in case we need them later. 
      if line.startswith('String Group'): 
       # Here is the benefit of manual iteration. We can call next() more than once per loop iteration. 
       # Sometimes once we've read a line, we just want to immediately get the next line. 
       # strip() removes whitespace and the newline characters 
       buf.add(next(it).strip()) 
       # How exactly you will parse your lines depends on your needs. Here, I assume that the last word in 
       # the current line is an identifier that we are interested in. 
       string_group = line.strip().split()[-1] 
       # Add the lines in the buffer to the circular vector belonging to the identifier. 
       string_group_last_four[string_group].add(list(l for l in buf)) 
       buf = circ_vec(2) 
      # For lines beginning with 'Other Main String for', we need to 
      # remember the identifier but there's no other information to 
      # record. 
      elif line.startswith('Other Main String for'): 
       string_group_context = line.strip().split()[-1] 
      # Use find() instead of startswith() because the 
      # 'test1(OK) # test2(OK)' lines begin with whitespace. startswith() 
      # would depend on the specific whitespace characters which could 
      # be confusing. 
      elif line.find('test1(OK) test2(OK)') != -1: 
       print('String group' + string_group_context + ' has a test hit!') 
       # Print out the test lines. 
       for l in buf: print(l) 
       print('Four most recent "String Group ' + string_group_context + '" lines:') 
       # Use the identifier dict to get the last 4 relevant groups of lines 
       for cv in string_group_last_four[string_group_context]: 
        for l in cv: print(l) 
    # Manual iteration is terminated by an exception in Python. Catch and swallow it 
    except StopIteration: pass 
print("Done!") 

測試文件的內容。 我試圖讓它有點奇怪,有點行使代碼。

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 1 A 
    Useful information for A 

String Group 2 A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 1 B 
    Useful information for B 

String Group 3 A 
    Useful information for A 

String Group 2 B 
    Useful information for B 

String Group 4 A 
    Useful information for A 

String Group 5 A 
    Useful information for A 

String Group 6 A 
    Useful information for A 

String Group 3 B 
    Useful information for B 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 4 B 
    Useful information for B 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 7 A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

輸出

String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 1 A 
Useful information for A 
String Group 2 A 
Useful information for A 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 3 A 
Useful information for A 
String Group 4 A 
Useful information for A 
String Group 5 A 
Useful information for A 
String Group 6 A 
Useful information for A 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String Group 4 B 
Useful information for B 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 4 A 
Useful information for A 
String Group 5 A 
Useful information for A 
String Group 6 A 
Useful information for A 
String Group 7 A 
Useful information for A 
Done! 
+0

嗨Praxeolitic,感謝您的全面relpy。 Unfortunatley我認爲我對於這種類型的文本處理已經深入瞭解,我可以看到您的腳本適用於我的示例腳本,但是當我爲原始文本文件更改字符串時,我沒有看到前面的4行out,甚至使用line.find()in.place.startswith() – MikG

+0

請記住,find()不會返回一個布爾值。如果沒有匹配,則返回-1,否則返回匹配的起始索引。我的答案中的代碼就是一個例子。所有這些Python文檔都很好。 https://docs.python.org/2/library/string.html#string.find – Praxeolitic

+0

完美 - 我接受了你的回答!隨着你的建議,我設法捕捉了我之後的文本包,你非常有幫助,我非常感謝,並且有些壓抑了!如果我想要封裝某些文本(例如,從「String Group A」標記中說出4行和5行,我只需枚舉捕獲的行然後計數並拆分行? – MikG

1

的問題,因爲我理解它是要找到一個列表出現特定模式,並從此列表中提取一段文字。以下find_all()例程從字符串中提取模式(子)的所有出現。以下示例描繪如何使用它來獲取測試結果,但它取決於找到後續的end_pattern。

def find_all(s, sub): 
    indxs = [] 
    start = 0 
    ns = len(s) 
    nsub = len(sub) 
    while True: 
     indx = s.find(sub, start, ns) 
     if indx < 0: break 
     indxs.append(indx) 
     start = indx + nsub; print(start) 
    return indxs 

使用的草圖,給定的字符串(test_results)和串組A(group_A_pattern)和 「爲一個有用的信息」 端的圖案(end_group_pattern):

def get_test_results(test_results, group_A_pattern, end_group_pattern): 
    starts = find_all(test_results, group_A_pattern) 
    useful_A = [] 
    for start0 in starts[-4:]: 
     start = start0 + len(group_A_pattern) 
     stop = test_results.find(end_group_pattern, start) 
     useful_A.append(test_results[start:stop]) 
    return useful_A 

下面是測試代碼:

test_results = 'groupA some-useful end junk groupA more-useful end whatever' 
group_A_pattern = 'groupA' 
end_group_pattern = 'end' 
get_test_results(test_results, group_A_pattern, end_group_pattern) 

運行上面測試代碼產生:

[' some-useful ', ' more-useful '] 
+0

謝謝爲你的答案@xxyzzy。我決定最終與Praxeolitic的回答一起去,儘管我認爲對於你的兩個答案我都有點不深入! – MikG

相關問題