2013-05-26 78 views
0

我有一個大的逗號分隔的日誌文件。我想解析一些信息。CSV解析大日誌文件

2010-02-10 10:00:00.000 171.606 bad_gateway 
2010-02-10 10:00:00.234 400.680 bad_gateway 
2010-02-10 10:00:00.410 212.308 login_from 
2010-02-10 10:00:00.601 222.251 bad_gateway 

的問題是,我需要由時間範圍(例如:10:00:00.000-11:00:00.000)寫出事件和計算有多少是由持續時間長度一分鐘。我試圖做一個輸出文件是這樣的:

bad_gateway 
10:00 
     AVG  <1ms  1-10ms 10-100ms 100-500ms  500+ms 
     264.845          3 

login_from 
10:00 
     AVG  <1ms  1-10ms 10-100ms 100-500ms  500+ms 
     212.308          1 

bad_gateway         
10:01 
     AVG  <1ms  1-10ms 10-100ms 100-500ms  500+ms 
     xxx.xxx          x 

嘗試使用awk,但stucked搞清楚。感謝您的幫助!

這裏就是我得到了至今:

BEGIN { 
low["<1ms"]=0;high["<1ms"]=1 
low["1-10ms"]=1;high["1-10ms"]=10 
low["10-100ms"]=10;high["10-100ms"]=100 
low["100-500ms"]=100;high["100-500ms"]=500 
low[">500ms"]=500;high[">500ms"]=1000000000 
} 
{ 
    for (i in high) { 
    if ((duration > low[i]) && (duration <= high[i])) { 
    total+=duration 
    bin[i]++ 
    count++ 
} 
} 
} 

而在最後一節,我做的printf。

+0

顯示示例代碼 – bksi

+1

'我有一個大的逗號分隔的日誌文件...'逗號在哪裏?或者我忘了逗號是怎麼樣的。 – Kent

+0

我的錯,我從csv做了一個乾淨的文件。 – Telcom

回答

3

您的輸入數據很短,可以對其進行徹底測試。在這裏,您有一個awk腳本,它可以或多或少地處理您正在查找的內容。它是完全註釋,以便從這裏你可以修改它以滿足您的需求:

內容 script.awk

BEGIN { 
    header = sprintf("\t%-10s\t%10s\t%10s\t%10s\t%10s\t%10s", "AVG", "<1ms", "1-10ms", "10-100ms", "100-500ms", "500+ms") 

    ## Output slices if time. 
    slices = "1 10 100 500" 
    split(slices, slices_a) 

    ## Hardcoded start and end times. 
    start_time = mktime("2010 02 10 10 00 00") 
    end_time = mktime("2010 02 10 11 00 00") 
} 

{ 
    ## Extract hour, minute and second from time. 
    fields = split($2, time, /[:.]/) 
    if (fields != 4) { print "WARNING: Skipped line " FNR " because had bad formatted time." } 

    ## Save previous time to be able to compare if a second has passed. First line is 
    ## a special case because there is not yet a saved value. 
    if (FNR == 1) { 
     prev_time = mktime("2010 02 10 " time[1] " " time[2] " " time[3]) 
    } 
    else { 
     curr_time = mktime("2010 02 10 " time[1] " " time[2] " " time[3]) 

     ## When a second has passed, print all extracted data. 
     if (curr_time - prev_time > 59) { 

      print_minute_info(duration, prev_time, header, slices_a) 

      ## Initialize data. 
      prev_time = curr_time 
      delete duration 
     } 
    } 

    ## For each name (last field) concatenate durations. 
    duration[ $NF ] = duration[ $NF] "|" $3 
} 

END { 
    print_minute_info(duration, prev_time, header, slices_a) 
} 

## Traverse hash with following format (example): 
## duration[ bad_gateway ] = "|34.567|234.918|56.213|" 
## 
## So, for each key split with pipe, sum its values and try to 
## print a formatted output. 
function print_minute_info(duration,prev_time,header,slices_a,  name,sum,times,times_a,num_times,i,times_avg,printed) { 
    for (name in duration) { 
     sum = 0 
     times = substr(duration[name], 2) 
     split(times, times_a, /\|/) 
     num_times = length(times_a) 
     for (i = 1; i <= num_times; i++) { 
      sum = sum + times_a[i] 
     } 
     times_avg = sum/num_times 

     printf "%s\n", name 
     printf "%s\n", strftime("%H:%M", prev_time) 
     printf "%s\n", header 
     printf "\t%-10s", times_avg 

     ## This part tries to print the number of ocurrences just 
     ## below its header. It can be improved. 
     for (i = 1; i <= length(slices_a); i++) { 
      if (times_avg < slices_a[i]) { 
       printf "%10d\n", num_times 
       printed = 1 
       break 
      } 
      else { 
       printf "\t%10s", "" 
      } 
     } 
     if (! printed) { 
      printf "%10d\n", num_times 
     } 
     printf "\n" 
    } 
} 

並假設以下infile

2010-02-10 10:00:00.000 171.606 bad_gateway 
2010-02-10 10:00:00.234 400.680 bad_gateway 
2010-02-10 10:00:00.410 212.308 login_from 
2010-02-10 10:00:00.601 223.251 bad_gateway 
2010-02-10 10:01:00.401 224.251 bad_gateway 
2010-02-10 10:01:00.701 225.251 bad_gateway 
2010-02-10 10:01:04.401 226.251 login_to 
2010-02-10 10:02:04.401 1.251 login_to 

運行它想:

awk -f script.awk infile 

得出:

login_from 
10:00 
    AVG    <1ms  1-10ms 10-100ms 100-500ms  500+ms 
    212.308            1 

bad_gateway 
10:00 
    AVG    <1ms  1-10ms 10-100ms 100-500ms  500+ms 
    265.179            3 

bad_gateway 
10:01 
    AVG    <1ms  1-10ms 10-100ms 100-500ms  500+ms 
    224.751            2 

login_to 
10:01 
    AVG    <1ms  1-10ms 10-100ms 100-500ms  500+ms 
    226.251            1 

login_to 
10:02 
    AVG    <1ms  1-10ms 10-100ms 100-500ms  500+ms 
    1.251       1 
+0

我在運行腳本時遇到了錯誤。在第56行中:對本地數組times_a和第70行的非法引用:對局部變量slices_a的非法引用。你有什麼主意嗎? – Telcom

+0

@Telcom:我沒有得到任何錯誤。我正在運行它:'awk -f script.awk infile'(添加到答案中,因爲之前已經忘記了),並且還考慮了函數聲明中的故意空格。 'awk'在參數和局部變量之間切分很重要。 – Birei

+0

我也是,但一直給我這兩個錯誤。試圖調試什麼是錯誤的,但直到現在還沒有成功。 – Telcom

1

我對awk不夠熟練,但在perl中執行它非常簡單...將數據裝入存儲區通常需要使用散列或數組數據結構。只是正則表達式提取字段,然後使用哈希創建的計數器桶,並增加計數器每次出現像這樣:

while(<>) { # iterate over input file 
    // extract fields here... e.g. 
    // $errType =~ /(\S+)$/; 
    // etc. 
    $bins->{$errType}{$time}{$duration}++; 
} 

# now iterate over hashes and print out your report 
foreach $key1 (keys %$bins) { 
    foreach ... 
} 

不太答案你要找的,但也許它可以讓你在正確的軌道上。

+0

謝謝你,但我不熟練Perl。 – Telcom