2012-10-27 83 views
1

這是使用R解析這樣的日誌文件的最佳方式嗎?如何使用R解析網絡服務器日誌?

- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279 
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665 
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874 
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058 
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038 
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058 
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 

我必須考慮邊界情況,如在一行(內部和外部)中有2個IP。

謝謝!

+0

正則表達式也許可以用來解析這一點,類似於您在Perl做什麼。我的問題是,你最終希望數據看起來如何? –

+0

準備編寫bitchy正則表達式。標記的表達式和'gsub'是你的朋友。 – aL3xa

+0

如果你想讓你的生活更輕鬆Apache有一個非常靈活的方式來指定日誌文件的樣子。這種「通用日誌」格式是一種痛苦,因爲一半的東西是空間分隔的,另一半用方括號分隔,另一半用引號括起來,另一半用逗號分隔......它只是不加向上。請參閱http://httpd.apache.org/docs/1.3/logs.html以瞭解如何重新配置​​日誌並使其健康(假定訪問Web服務器)。 – Spacedman

回答

3

對於這個例子,用兩個NA和空格替換前面的破折號就足夠了。然後,您可以用解析read.table()

datlog <- readLines(textConnection('- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279 
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665 
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874 
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058 
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038 
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058 
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769')) 
datlog <- gsub("^-", "NA NA", datlog) 
datlog <- sub("\\,", " ", datlog) 
datlog<-read.table(text=datlog, fill=TRUE) 
datlog 

Spacedman被問及日期時間解析:

datlog[['dtime']] <- as.POSIXct(paste(sub("\\[", "", datlog[[5]]), 
             sub("\\]", "", datlog[[6]])), 
           format="%d/%b/%Y:%H:%M:%S %z") 
+0

如果查詢中有逗號,會失敗嗎?我不認爲他們被要求逃脫。顯然,解析日期還有一些工作要做。 – Spacedman

+0

如果你的意思是正則表達式模式,那麼我看到轉義並不是必須的,但不像許多其他不必要的轉義,不會拋出錯誤。 「解析」的請求有點模糊。人們也可以想象想要從HTML請求中提取信息。 –

+0

不,我的意思是\t GET路徑中的逗號。日誌格式爲:短劃線或一個或多個以逗號分隔的空格分隔的IP地址,重複三次,日期放在方括號中,帶引號的請求,狀態碼,大小。繁瑣! – Spacedman