2017-10-07 33 views
1

我正在寫一個腳本,將大的(400mb)apache日誌文件解析到熊貓表中。如何加快使用RegEx解析Apache日誌以擴展Pandas數據框?

我的舊筆記本電腦在大約2分鐘內用腳本解析apache日誌文件。 現在我想知道它不能更快​​?

Apache的日誌文件的結構是這樣的: 葉 - - [時間戳]「GET ......法」 HTTP狀態代碼字節「地址」,「用戶代理」 例如:

93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0" 

我的代碼使用正則表達式findall。我也測試了匹配和搜索方法。但他們似乎更慢。

reg_dic = { 
    "ip" : r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', 
    "timestamp" : r'\[\d+\/\w+\/\d+\:\d+\:\d+\:\d+\s\+\d+\]', 
    "method" : r'"(.*?)"', 
    "httpstatus" : r'\s\d{1,3}\s', 
    "bytes_" : r'\s\d+\s\"', 
    "adress" : r'\d\s\"(.*?)"', 
    "useragent" : r'\"\s\"(.*?)"' 
} 

    for name, reg in reg_dic.items() : 
     item_list = [] 
     with open (file) as f_obj: 
      for line in f_obj : 
       item = re.findall(reg , line) 
       item = item[0] 
       if name == "bytes_" : 
        item = item.replace("\"", "") 
       item = item.strip() 
       item_list.append(item) 
     df[ name ] = item_list 
     del item_list 
+2

參見[這條巨蟒演示(https://ideone.com/LLW3Uf)和[正則表達式演示(https://開頭regex101的.com/R/UOtsAL/1)。如果你的日誌行總是相同的格式,這應該是快速和安全的。 –

回答

2

您可以使用extractexpand放慢參數true,以便將返回基於提取數據的數據幀。希望它可以幫助

例DF

df = pd.DataFrame({"log":['93.185.11.11 - - [13/Aug/2016:05:34:12 
+0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 
(Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"', 

'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; 
rv:54.0) Gecko/20100101 Firefox/54.0"', 

'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; 
rv:54.0) Gecko/20100101 Firefox/54.0"']}) 

這是基於@Wiktor Stribiżew的正則表達式改善

ws = '^(?P<ip>[\d.]+)(?:\s+\S+){2}\s+\[(?P<timestamp>[\w:/\s+]+)\]\s+"(?P<method>[^"]+)"\s+(?P<httpstatus>\d+)\s+(?P<bytes>\d+)\s+(?P<adress>"[^"]+")\s+(?P<useragent>"[^"]+")$' 

new = df['log'].str.extract(ws,expand=True) 

輸出:

 
      ip     timestamp    method httpstatus \ 
0 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 
1 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 
2 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 

    bytes    adress \ 
0 575 "http://google.com" 
1 575 "http://google.com" 
2 575 "http://google.com" 

              useragent 
0 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 
1 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 
2 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 
1

我不認爲我們需要太多很多RegEx的這個簡單的任務:

fn = r'D:\temp\.data\46620093.log' 
cols = ['ip','l','userid','timestamp','tz','request','status','bytes','referer','useragent'] 

df = pd.read_csv(fn, delim_whitespace=True, names=cols).drop('l', 1) 

這給了我們:

In [179]: df 
Out[179]: 
      ip userid    timestamp  tz    request \ 
0 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 
1 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 
2 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 

    status bytes   referer \ 
0  200 575 http://google.com 
1  200 575 http://google.com 
2  200 575 http://google.com 

              useragent 
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 

現在我們只需要連接timestamptz成一列,擺脫[]

df['timestamp'] = df['timestamp'].str.replace(r'\[(\d+/\w+/\d+):(\d+:\d+:\d+)', r'\1 \2') \ 
        + ' ' + df.pop('tz').str.strip(r'[\]]') 

結果:

In [181]: df 
Out[181]: 
      ip userid     timestamp    request \ 
0 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 
1 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 
2 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 

    status bytes   referer \ 
0  200 575 http://google.com 
1  200 575 http://google.com 
2  200 575 http://google.com 

              useragent 
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 

注意:我們可以eas隨手轉換成datetime D類datetime(UTC時間沒有時區):

In [182]: pd.to_datetime(df['timestamp']) 
Out[182]: 
0 2016-08-13 03:34:12 
1 2016-08-13 03:34:12 
2 2016-08-13 03:34:12 
Name: timestamp, dtype: datetime64[ns]