使用Python在txt文件中搜索多個字符串

-3

我在掃描文本文件方面遇到了一個棘手的搜索場景，並希望得到一些關於如何處理該場景的最佳方法，或者通過分解它或任何有用的模塊。我在下面的例子中有一個文本文件，我正在尋找一個像「test1（OK）test2（OK）」這樣的文本序列。如果符合此搜索模式，則需要返回文件並查找另一個字符串「String Group A」的最後4個條目，並從這些以前的字符串組中捕獲「A的有用信息」信息。爲了讓事情變得更加困難，我有類似的'B'信息組，這使得事情變得棘手，我必須對所有Group'B'信息進行相同的處理！使用Python在txt文件中搜索多個字符串

String Group A 
    Useful information for A 

String Group A 
    Useful information for A 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for B」 from 「String Group B」 

String Group B 
    Useful information for B 

String Group A 
    Useful information for A 

And so on…

就像我說的，我找上最佳的前進道路的想法，在這個文本文件中收集信息似乎跳來跳去太多。我有一個想法，將'String Group A'視爲line（x），然後當「test1（OK）test2（OK）」條件滿足返回到line（x）和line（x-1）時，和線（x-2）和線（x-3），並抓住每個「A的有用信息」，但我不相信這是最好的前進方向。我要指出的是，文本文件是巨大的，所包含的條目1000絃樂A組和B

感謝您的閱讀，

MikG

來源

2014-06-23 MikG

你有興趣的任何事件 '有用的信息爲' 前 'TEST1（OK）TEST2（OK）' 和 '其他主要字符串A' 或只那些之前？ – Praxeolitic

Hi Praxeolitic，我只查找'字符串組A'的一次'test1（OK）test2（OK）'爲'A'的前4條目，同樣我需要重複'B'條件 – MikG

下面介紹如何定義一個循環矢量類，該類只跟蹤從上到下處理文件時可能需要的數據。它有相當數量的評論，因此它可以被理解，而不僅僅是代碼轉儲。解析的細節當然強烈依賴於你的輸入是什麼樣子。我的代碼根據您可能需要更改的示例文件進行假設。例如，使用startswith（）可能過於僵化，這取決於您的輸入，您可能需要使用find（）。

代碼

from __future__ import print_function 
import sys 
from itertools import chain 

class circ_vec(object): 
    """A circular fixed vector. 
    """ 
    # The use of slots drastically reduces memory footprint of Python classes - 
    # it removes the need for a hash table for every object 
    __slots__ = ['end', 'elems', 'capacity'] 
    # end will keep track of where the next element is to be added 
    # elems holds the last X elemenst that were added 
    # capacity is how many elements we will hold 

    def __init__(self, capacity): 
     # we only need to specify the capacity up front 
     # elems is empty 
     self.end = 0 
     self.elems = [] 
     self.capacity = capacity 

    def add(self, e): 
     new_index = self.end 
     if new_index < len(self.elems): 
      self.elems[new_index] = e 
     else: 
      # If we haven't seen capacity # of elements yet just append 
      self.elems.append(e) 
     self.end = (self.end + 1) % self.capacity 

    def __len__(self): 
     return len(self.elems) 

    # This magic method allows brace [ ] indexing 
    def __getitem__(self, index): 
     if index >= len(self.elems): 
      print("MY RAISE") 
      raise IndexError 
     first = self.capacity - self.end - 1 
     index = (index + first) % self.capacity 
     # index = (self.end + key) % self.capacity 
     # print("LEN = ", len(self.elems)) 
     # print("INDEX = ", index) 
     return self.elems[index] 

    # This magic method allows iteration 
    def __iter__(self): 
     if not self.elems: 
      return iter([]) 
     elif len(self.elems) < self.capacity: 
      first = 0 
     else: 
      first = self.end 
     # Iterate from the oldest element to the newest 
     return chain(iter(self.elems[first:]), iter(self.elems[:first])) 

string_group_last_four = { k : circ_vec(4) for k in ['A', 'B'] } 
with open(sys.argv[1], 'r') as f: 
    string_group_context = None 
    # We will manually iterate through the file. Get an iterator using iter(). 
    it = iter(f) 
    # As per the example, the file we're reading groups lines in twos. 
    buf = circ_vec(2) 
    try: 
     while(True): 
      line = next(it) 
      buf.add(line.strip()) 
      # The lines beginning with 'String Group' should be recorded in case we need them later. 
      if line.startswith('String Group'): 
       # Here is the benefit of manual iteration. We can call next() more than once per loop iteration. 
       # Sometimes once we've read a line, we just want to immediately get the next line. 
       # strip() removes whitespace and the newline characters 
       buf.add(next(it).strip()) 
       # How exactly you will parse your lines depends on your needs. Here, I assume that the last word in 
       # the current line is an identifier that we are interested in. 
       string_group = line.strip().split()[-1] 
       # Add the lines in the buffer to the circular vector belonging to the identifier. 
       string_group_last_four[string_group].add(list(l for l in buf)) 
       buf = circ_vec(2) 
      # For lines beginning with 'Other Main String for', we need to 
      # remember the identifier but there's no other information to 
      # record. 
      elif line.startswith('Other Main String for'): 
       string_group_context = line.strip().split()[-1] 
      # Use find() instead of startswith() because the 
      # 'test1(OK) # test2(OK)' lines begin with whitespace. startswith() 
      # would depend on the specific whitespace characters which could 
      # be confusing. 
      elif line.find('test1(OK) test2(OK)') != -1: 
       print('String group' + string_group_context + ' has a test hit!') 
       # Print out the test lines. 
       for l in buf: print(l) 
       print('Four most recent "String Group ' + string_group_context + '" lines:') 
       # Use the identifier dict to get the last 4 relevant groups of lines 
       for cv in string_group_last_four[string_group_context]: 
        for l in cv: print(l) 
    # Manual iteration is terminated by an exception in Python. Catch and swallow it 
    except StopIteration: pass 
print("Done!")

測試文件的內容。 我試圖讓它有點奇怪，有點行使代碼。

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 1 A 
    Useful information for A 

String Group 2 A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 1 B 
    Useful information for B 

String Group 3 A 
    Useful information for A 

String Group 2 B 
    Useful information for B 

String Group 4 A 
    Useful information for A 

String Group 5 A 
    Useful information for A 

String Group 6 A 
    Useful information for A 

String Group 3 B 
    Useful information for B 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 4 B 
    Useful information for B 

Other Main String for B 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 

String Group 7 A 
    Useful information for A 

Other Main String for A 
    test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」

輸出

String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 1 A 
Useful information for A 
String Group 2 A 
Useful information for A 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 3 A 
Useful information for A 
String Group 4 A 
Useful information for A 
String Group 5 A 
Useful information for A 
String Group 6 A 
Useful information for A 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String groupB has a test hit! 
Other Main String for B 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group B" lines: 
String Group 1 B 
Useful information for B 
String Group 2 B 
Useful information for B 
String Group 3 B 
Useful information for B 
String Group 4 B 
Useful information for B 
String groupA has a test hit! 
Other Main String for A 
test1(OK) test2(OK) *** Condition Met *** #Now go back and collect the last 4 entries of 「Useful information for A」 from 「String Group A」 
Four most recent "String Group A" lines: 
String Group 4 A 
Useful information for A 
String Group 5 A 
Useful information for A 
String Group 6 A 
Useful information for A 
String Group 7 A 
Useful information for A 
Done!

來源

2014-06-23 12:08:19 Praxeolitic

嗨Praxeolitic，感謝您的全面relpy。 Unfortunatley我認爲我對於這種類型的文本處理已經深入瞭解，我可以看到您的腳本適用於我的示例腳本，但是當我爲原始文本文件更改字符串時，我沒有看到前面的4行out，甚至使用line.find（）in.place.startswith（） – MikG

請記住，find（）不會返回一個布爾值。如果沒有匹配，則返回-1，否則返回匹配的起始索引。我的答案中的代碼就是一個例子。所有這些Python文檔都很好。 https://docs.python.org/2/library/string.html#string.find – Praxeolitic

完美 - 我接受了你的回答！隨着你的建議，我設法捕捉了我之後的文本包，你非常有幫助，我非常感謝，並且有些壓抑了！如果我想要封裝某些文本（例如，從「String Group A」標記中說出4行和5行，我只需枚舉捕獲的行然後計數並拆分行？ – MikG

的問題，因爲我理解它是要找到一個列表出現特定模式，並從此列表中提取一段文字。以下find_all（）例程從字符串中提取模式（子）的所有出現。以下示例描繪如何使用它來獲取測試結果，但它取決於找到後續的end_pattern。

def find_all(s, sub): 
    indxs = [] 
    start = 0 
    ns = len(s) 
    nsub = len(sub) 
    while True: 
     indx = s.find(sub, start, ns) 
     if indx < 0: break 
     indxs.append(indx) 
     start = indx + nsub; print(start) 
    return indxs

使用的草圖，給定的字符串（test_results）和串組A（group_A_pattern）和「爲一個有用的信息」端的圖案（end_group_pattern）：

def get_test_results(test_results, group_A_pattern, end_group_pattern): 
    starts = find_all(test_results, group_A_pattern) 
    useful_A = [] 
    for start0 in starts[-4:]: 
     start = start0 + len(group_A_pattern) 
     stop = test_results.find(end_group_pattern, start) 
     useful_A.append(test_results[start:stop]) 
    return useful_A

下面是測試代碼：

test_results = 'groupA some-useful end junk groupA more-useful end whatever' 
group_A_pattern = 'groupA' 
end_group_pattern = 'end' 
get_test_results(test_results, group_A_pattern, end_group_pattern)

運行上面測試代碼產生：

[' some-useful ', ' more-useful ']

來源

2014-06-23 11:04:43 xxyzzy

謝謝爲你的答案@xxyzzy。我決定最終與Praxeolitic的回答一起去，儘管我認爲對於你的兩個答案我都有點不深入！ – MikG

使用Python在txt文件中搜索多個字符串

回答

相關問題