2011-08-22 23 views
2

我想解析一些CSV文件,使用awk如何使用awk從文件中獲取與統計相關的所有信息?

CSV文件,我用工作看起來像這樣:

fnName,minAccessTime,maxAccessTime 
getInfo,300,600 
getStage,600,800 
getStage,600,800 
getInfo,250,620 
getInfo,200,700 
getStage,700,1000 
getInfo,280,600 

我需要找到的最小,最大和平均數字爲2列3,無論是跨所有數據和各項功能。

回答

4

awk腳本應該給你所需的所有技能,以獲得你想要的。

它基本上遍歷輸入文件中的所有行,忽略第二個字段爲minAccessTime(CSV標頭)的行。

在所有其他記錄上,它更新最小最小值,最大最小值,最小最大值,最大最大值,最小和和和的最小值,整體數據加上每個單獨的函數名稱的最大值。

前者存儲在countmin_minmax_minmin_maxmax_maxsum_minsum_max。後者存儲在具有相似名稱的關聯數組中(附加_arr)。

然後,一旦所有記錄被讀取,END部分輸出信息。

NR > 1 { 
    count++; 
    sum_min += $2; 
    sum_max += $3; 
    if (count == 1) { 
     min_min = $2; 
     max_min = $2; 
     min_max = $3; 
     max_max = $3; 
    } else { 
     if ($2 < min_min) { min_min = $2; } 
     if ($2 > max_min) { max_min = $2; } 
     if ($3 < min_max) { min_max = $3; } 
     if ($3 > max_max) { max_max = $3; } 
    } 

    count_arr[$1]++; 
    sum_min_arr[$1] += $2; 
    sum_max_arr[$1] += $3; 
    if (count_arr[$1] == 1) { 
     min_min_arr[$1] = $2; 
     max_min_arr[$1] = $2; 
     min_max_arr[$1] = $3; 
     max_max_arr[$1] = $3; 
    } else { 
     if ($2 < min_min_arr[$1]) { min_min_arr[$1] = $2; } 
     if ($2 > max_min_arr[$1]) { max_min_arr[$1] = $2; } 
     if ($3 < min_max_arr[$1]) { min_max_arr[$1] = $3; } 
     if ($3 > max_max_arr[$1]) { max_max_arr[$1] = $3; } 
    } 
} 

END { 
    print "Overall:" 
    print " Total records = " count 
    print " Sum of minima = " sum_min 
    print " Sum of maxima = " sum_max 
    if (count > 0) { 
     print " Min of minima = " min_min 
     print " Max of minima = " max_min 
     print " Min of maxima = " min_max 
     print " Max of maxima = " max_max 
     print " Avg of minima = " sum_min/count 
     print " Avg of maxima = " sum_max/count 
    } 
    for (task in count_arr) { 
     print "Function " task ":" 
     print " Total records = " count_arr[task] 
     print " Sum of minima = " sum_min_arr[task] 
     print " Sum of maxima = " sum_max_arr[task] 
     print " Min of minima = " min_min_arr[task] 
     print " Max of minima = " max_min_arr[task] 
     print " Min of maxima = " min_max_arr[task] 
     print " Max of maxima = " max_max_arr[task] 
     print " Avg of minima = " sum_min_arr[task]/count_arr[task] 
     print " Avg of maxima = " sum_max_arr[task]/count_arr[task] 
    } 
} 

存儲該腳本爲qq.awk,並把您的樣本數據爲qq.in,然後運行:

awk -F, -f qq.awk qq.in 

產生如下的輸出,其中我比較肯定會給你一切可能的資料片你需要:

Overall: 
    Total records = 7 
    Sum of minima = 2930 
    Sum of maxima = 5120 
    Min of minima = 200 
    Max of minima = 700 
    Min of maxima = 600 
    Max of maxima = 1000 
    Avg of minima = 418.571 
    Avg of maxima = 731.429 
Function getStage: 
    Total records = 3 
    Sum of minima = 1900 
    Sum of maxima = 2600 
    Min of minima = 600 
    Max of minima = 700 
    Min of maxima = 800 
    Max of maxima = 1000 
    Avg of minima = 633.333 
    Avg of maxima = 866.667 
Function getInfo: 
    Total records = 4 
    Sum of minima = 1030 
    Sum of maxima = 2520 
    Min of minima = 200 
    Max of minima = 300 
    Min of maxima = 600 
    Max of maxima = 700 
    Avg of minima = 257.5 
    Avg of maxima = 630 
6

我知道你不是在尋找非awk的解決方案,但我想我會分享一些R代碼,數字高程模型介紹彙總數據的方式。

# read in data 
awk <- read.table(textConnection("fnName,minAccessTime,maxAccessTime 
getInfo,300,600 
getStage,600,800 
getStage,600,800 
getInfo,250,620 
getInfo,200,700 
getStage,700,1000 
getInfo,280,600"), header = TRUE, sep = ",") 

# split according to the function 
awk.split <- split(awk, awk$fnName) 

# for each function, calculate full summary for columns 2 and 3 
lapply(X = awk.split, FUN = function(x) { 
      summary(x[2:3]) 
     }) 

結果:

$getInfo 
minAccessTime maxAccessTime 
Min. :200.0 Min. :600 
1st Qu.:237.5 1st Qu.:600 
Median :265.0 Median :610 
Mean :257.5 Mean :630 
3rd Qu.:285.0 3rd Qu.:640 
Max. :300.0 Max. :700 

$getStage 
minAccessTime maxAccessTime 
Min. :600.0 Min. : 800.0 
1st Qu.:600.0 1st Qu.: 800.0 
Median :600.0 Median : 800.0 
Mean :633.3 Mean : 866.7 
3rd Qu.:650.0 3rd Qu.: 900.0 
Max. :700.0 Max. :1000.0 
0

如果你堅持在awk ...

$ awk -F, ' 
> func newmin(fname, array, value) { if (!(fname in array) || array[fname]>value) array[fname] = value } 
> func newmax(fname, array, value) { if (!(fname in array) || array[fname]<value) array[fname] = value } 
> NR>1 { 
> newmin($1,min2,$2) 
> newmin("global",min2,$2) 
> newmax($1,max2,$2) 
> newmax("global",max2,$2) 
> newmin($1,min3,$3) 
> newmin("global",min3,$3) 
> newmax($1,max3,$3) 
> newmax("global",max3,$3) 
> } 
> END { for (fname in min2) { print fname, min2[fname], max2[fname], min3[fname], max3[fname] } }' 
相關問題