如何以Pythonic方式進行IIS日誌分析？

好的，所以我有一些我想用Python解析的IIS日誌（我對atm相當陌生）。 IIS日誌的樣本是這樣的：如何以Pythonic方式進行IIS日誌分析？

#Software: Microsoft Internet Information Server 6.0 
#Version: 1.0 
#Date: 1998-11-19 22:48:39 
#Fields: date time c-ip cs-username s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referrer) 

1998-11-19 22:48:39 206.175.82.5 - 208.201.133.173 GET /global/images/navlineboards.gif - 200 540 324 157 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95) USERID=CustomerA;+IMPID=http://www.loganalyzer.net 
1998-11-20 22:55:39 206.175.82.8 - 208.201.133.173 GET /global/something.pdf - 200 540 324 157 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95) USERID=CustomerA;+IMPID=http://www.loganalyzer.net

只有2日誌數據線在這裏，我在那裏有成千上萬的每個記錄。所以，這只是一個簡單的例子。

從這個日誌我想提取的數據 - 客戶端IP地址的數量最多的連接數，下載次數最多的文件的數量，訪問最多的URI數量等等......基本上是什麼我想要的是得到一些統計...例如，作爲一個結果，我希望看到這樣的事情：

file download_count 
example1.pdf 9 
example2.pdf 6 
example3.doc 2

或

IP file hits 
192.168.1.5 /sample/example1.gif 8 
192.168.1.9 /files/example2.gif 8

什麼我不知道是如何處理這個以pythonic的方式。起初，我以爲我會分割日誌的每一行，並從中列出一個列表，並將每個行添加到更大的列表中（我將它看作是一個二維數組）。然後，我進入了從這個大列表中提取統計數據的階段，現在我認爲將字典從所有數據中提取出來，並通過字典鍵和字典值來統計數據可能會更好？這是比使用列表更好的方法嗎？如果我應該更好地使用列表，我應該如何處理它？我該怎麼做谷歌，我該找什麼？

所以我正在尋找關於如何通常應該這樣做的想法。謝謝。

來源

2011-06-28 pootzko

谷歌「蟒蛇IIS解析器」，並看看頂部2墊謝（第三個是你的問題） –

假設skip_header(file)返回僅從文件日誌線和parse(line)提取(ip, path)從線：

from collections import defaultdict 
first = defaultdict(int) 
second = defaultdict(lambda: defaultdict(int)) 
for line in skip_header(file): 
    ip, path = parse(line) 
    first[path] += 1 
    second[ip][path] += 1

用於第一

print "path count" 
for path, count in first.iteritems(): 
    print "%s %d" % (path, count)

爲第二：

print "ip path count" 
for ip,d in second.iteritems(): 
    for path, count in d.iteritems(): 
     print "%s %s %d" % (ip, path, count)

來源

2011-06-28 08:50:10

謝謝丹。順便說一句，我用python3，所以如果有人試過這個，你需要使用items（）而不是iteritems（），當然還有print（）。 – pootzko

如何以Pythonic方式進行IIS日誌分析？

回答

相關問題