如何從文本文件中使用前綴提取部分字符串

我有一個文本文件，其中某些區域包含以下字符串。如何從文本文件中使用前綴提取部分字符串

20170818_141903 Test ! Vdd 3.000000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS

不幸的是，它不是逗號或製表符分隔，每行都是一個大字符串。

我已經閱讀了整個文件，並試圖提取一切是二進制數據。

這意味着我想要的一切，其間下列字符

MMS ...... SS

我也想提取例如值P1後:,或VDD：從這些地區

Vdd 3.000000; P: 20.000000...........................etc

我已經做了目前：

import re 

match = re.search(r'\P: (\w+)', LONG_STRING) 
     if match: 
      print match.group(1)

但是這並不提取完整的浮點數，它忽略了小數點位置

來源

2017-08-23 cc6g11

答案v2.0。總的來說，這段代碼非常僵硬，並且不是最清晰的代碼，但是現在我無法爲您提供的示例提供更好的解決方案。

>>> import re 

>>> that_long_row = "20170818_141903 Test ! Vdd 3.000$000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS " 

>>> regex = (r'^'      # start of a string symbol 
     r'.+'       # escape any character 
     r'Vdd '      # until "Vdd " is reached 
     r'(?P<Vdd>[0-9\.]+)'   # select a continuous sequence of numbers and dots folowing that word and assign it to a group "Vdd" 
     r'.+'       # again, skip some more chars 
     r'P: '       # find "P: " word 
     r'(?P<P>[0-9\.]+)'    # select a continuous sequence of numbers and dots and assign to a group "P" 
     r'.+'       # the same goes for your byte "Message" between "MMS" and "SS" symbols 
     r'MMS' 
     r'(?P<Message>[0-1]+)'   # except that it only matches 0 and 1 
     r'SS' 
     r'.+'       # as @Evan mentioned, you need this to escape some possible trailing symbols 
     r'$'       # end of a string symbol 
     ) 

# the same but in a compact form: 
>>> regex = r'^.+Vdd (?P<Vdd>[0-9\.]+).+P: (?P<P>[0-9\.]+).+MMS(?P<Message>[0-1]+)SS.+$' 

>>> match = re.match(regex, that_long_row) 

# matching will form a groupdict that is like a normal dict 
# and you can access any matched group value by its name 

>>> match.groupdict() 
{'Vdd': '3.000', 'P': '20.000000', 'Message': ...

接下來，如果你想解析文件這樣的方式，我想創建一個簡單的類來處理所有的數據，類型轉換，驗證等

class Message: 
    def __init__(self, Vdd, P, Message): 
     self.vdd = float(Vdd) 
     self.p = float(P) 
     self.text = Message 

data = [] 

with open('yourfile', 'r') as f: 
    for line in f: 
     match = re.match(regex, line) 
     try: 
      data.append(Message(**match.groupdict())) 
     except ValueError: 
      data.append('CORRUPTED')

等。

來源

2017-08-23 12:54:30

他給出的字符串最後有一個空格，所以如果你想在最後加上$，你可能想把它包含在正則表達式中。此外，考慮到您花時間編寫了令人敬畏的正則表達式，收集所有這些內容的好的列表理解可能會很有用。我不想再作答，當你做所有的工作時讓他讚揚我。 – Evan

如何找出所有這些正則表達式參數的含義。他們看起來很可怕。 – cc6g11

2 @ cc6g11，他們確實做到了！我試圖讓我的回答更清楚。希望這會有所幫助！ –

如何從文本文件中使用前綴提取部分字符串

回答

相關問題