2015-10-17 46 views
1

我正在解析大量大文件,並希望確保儘可能高效地執行此操作。一個我解析的線條看起來像這樣(Windows安全事件日誌4624):Python - 高效搜索多個模式的文件行

Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111 

我想知道的是,什麼是從線拉出多個領域的最有效方法是什麼?我可以重複劃分線路,直到我到達每個我感興趣的領域,但是我覺得反覆循環線是浪費時間/資源。

有沒有在該行看起來只有一次,但拉出來,例如一種智能的方式,以下字段:

LogonType = 3 
TargetUserName = johndoe 
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 

舉個例子,有什麼我可以做的就是重複以下過程:

part = line.partition('TargetUserName = ')[2] 
username = part.partition(' ')[0] 

得到我想要的每個字段(上面的例子讓我只是用戶名),但再次感覺對我來說效率低下。

有沒有更好的方法來處理它?

+0

你看過[regexes](https://docs.python.org/2/howto/regex.html)嗎? –

+0

是的,我過去曾經使用過正則表達式,有沒有在同一個正則表達式操作中匹配多個模式的方法?或者,對於每個我感興趣的模式,我都必須具有不同的re.match()或search()。謝謝! – DJMcCarthy12

+0

查看[正則表達式匹配的'.group()'方法](https://docs.python.org/2/library/re.html#re.MatchObject.group)。 –

回答

2

每個字段名稱是一組大寫和小寫字符。它們與=的價值分離。每個值都是一組非空白字符。您可以使用re.findall和匹配的組來查找所有「letters = nonwhitespace」實例。這會給你tuplelist一個S,您可以保存或遍歷並傳遞給格式字符串:

>>> s = '''Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111 ''' 
>>> import re 
>>> for item in re.findall(r'([A-Za-z]+) = (\S+)', s): 
...  print('{} = {}'.format(*item)) 
... 
SubjectUserSid = S-1-0-0 
SubjectUserName = - 
SubjectDomainName = - 
SubjectLogonId = 0x0000000000000000 
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 
TargetUserName = johndoe 
TargetDomainName = TestDomain 
TargetLogonId = 0x0000000001111111 
LogonType = 3 
LogonProcessName = NtLmSsp 
AuthenticationPackageName = NTLM 
WorkstationName = TestWorkstation 
LogonGuid = {00000000-0000-0000-0000-000000000000} 
TransmittedServices = - 
LmPackageName = NTLM 
KeyLength = 128 
ProcessId = 0x0000000000000000 
ProcessName = - 
IpAddress = 1.1.1.1 
IpPort = 11111 

你也可以把它變成一本字典爲方便:

>>> d = dict(re.findall(r'([A-Za-z]+) = (\S+)', s)) 
>>> d['LogonType'] 
'3' 
+0

這太棒了。一個問題 - 有沒有辦法可以做到這一點,但將.group()邏輯應用於它只會在'='後面抓住'nonwhitespace'?謝謝! – DJMcCarthy12

+0

如果你已經把它變成一個字典(我會推薦),你可以用'd.values()'來獲取它的值。 – TigerhawkT3

+0

但是,如果你真的不想捕獲字段名稱,可以簡單地從'[A-Za-z]'周圍刪除括號。 – TigerhawkT3

1
st = 'Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111'; 

using re module and re.findall you can I think get want you want 

    import re 
    li = re.findall(r'LogonType\s*=\s*\d+|TargetUserName\s*=\s*\w+|TargetUserSid\s*=\s*\w-.*?\s',st,re.MULTILINE| re.DOTALL) 
    >>>li 
    ['TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 ', 'TargetUserName = johndoe', 'LogonType = 3']