2014-07-21 47 views
1

我有這個腳本,我建立了它,它工作正常,但我想結合2個awks在一個所以我,我有1行的所有信息,這可能嗎?從域結合在一個bash中的2個awk循環

for i in `cat domains` ; do 
    IFS='=' read -a array <<< "$i" 
    CC=`echo "${array[0]}"` 

    awk -v c=$CC '{a[substr($4,2,17)]++}END{for(i in a){print i, a[i], c}}' "${array[1]}".access_log | sort 
    awk -v c=$CC '{if ($0 ~ /html/) b[substr($4,2,17)]++}END{for(j in b){print j, b[j], c}}' "${array[1]}".access_log | sort 

    exit 
done 

片段:

af=www.google.com.af 
al=www.google.al 
ao=www.google.co.ao 
ar=www.google.com.ar 
au=www.google.com.au 

例如: 給出AF = www.google.com.af 針對www.google.com.af.access_log運行

- - - [21/Jul/2014:14:35:18 +0200] "GET /apple-touch-icon.png HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:35:18 +0200] "GET /apple-touch-icon.png HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:36:18 +0200] "GET /apple-touch-icon.png HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:36:18 +0200] "GET /main.html HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:36:18 +0200] "GET /main.html HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:37:18 +0200] "GET /main.html HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:37:18 +0200] "GET /main.html HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 
- - - [21/Jul/2014:14:37:18 +0200] "GET /main.html HTTP/1.1" 404 246 "-" "MobileSafari/9537.53 CFNetwork/672.1.15 Darwin/14.0.0" 556 

應該返回

21/Jul/2014:14:35 total: 2 html: 0 
21/Jul/2014:14:36 total: 3 html: 2 
21/Jul/2014:14:37 total: 3 html: 3 
+0

請顯示'domains'的樣本。 – konsolebox

+0

編輯你的答案向我們展示一個'domains'的例子會很有用。 –

+0

如果我們有'域'的一些數據和你喜歡的結果,我們可以在'awk'中完成所有的',''''IFS'等等。 – Jotne

回答

0

您可以簡化您的慶典,並與合併你的兩個awks:

while IFS== read cc domain; do 
    awk -v c="$CC" 'BEGIN { OFS = ":" } { d = substr($4,2,17); ++a[d] } /html/ { ++b[d] } END { for (i in a) print i, a[i], b[i] ? b[i] : "0", c }' "$domain".access_log | sort 
done < domains 

爲了保持秩序,而無需使用sort

while IFS== read cc domain; do 
    awk -v cc="$CC" 'BEGIN { OFS = ":" } { i = substr($4,2,17) } !a[i]++ { d[++j] = i } /html/ { ++b[i] } END { for (j = 1; j in d; ++j) { i = d[j]; print i, a[i], b[i] ? b[i] : "0", cc } }' "$domain".access_log 
done < domains 
+0

它是完美的! :)任何想法爲什麼awk不保留行的默認順序? –

+0

@punked這取決於實施。有時爲了更快的處理,關聯數組不會被排序。您可以使用另一個索引數組來按順序存儲鍵。 – konsolebox

+0

它不依賴於實現。 awk數組以散列表的形式存儲,並且'in'運算符按照它們的存儲順序訪問它們。您可以使用'PROCINFO [sorted_in]'在GNU awk中修改該行爲,請參閱手冊頁,但默認情況下它只是定義了排序順序,沒有選項可以按照讀取它們的順序訪問它們,除非您提供函數提供該訂單。 –

1

它看起來像你需要這樣的事情(使用GNU AWK的ENDFILE和刪除數組):

awk ' 
NR==FNR { ARGV[ARGC++] = $2 ".access_log"; next } 
{ 
    time = substr($4,2,17) 
    totCount[time]++ 
    if (/html/) 
     htmlCount[time]++ 
} 
ENDFILE { 
    for (time in totCount) { 
     print time, "total:", totCount[time], "html:", htmlCount[time]+0, FILENAME 
    } 
    delete totCount 
    delete htmlCount 
} 
' FS="=" domains FS=" " 

沒有必要的周邊shell循環。如果你想在你的輸出時間戳的順序的順序,你的輸入相匹配剛剛調整它維持秩序的軌道:

awk ' 
NR==FNR { ARGV[ARGC++] = $2 ".access_log"; next } 
{ 
    time = substr($4,2,17) 
    totCount[time]++ 
    if (/html/) 
     htmlCount[time]++ 
    if (!seen[time]++) 
     times[++numTimes] = time 
} 
ENDFILE { 
    for (i=1; i <= numTimes; i++) { 
     time = times[i] 
     print time, "total:", totCount[time], "html:", htmlCount[time]+0, FILENAME 
    } 
    delete totCount 
    delete htmlCount 
    delete times 
    delete seen 
    numTimes = 0 
} 
' FS="=" domains FS=" " 
21/Jul/2014:14:35 total: 2 html: 0 www.google.com.af.access_log 
21/Jul/2014:14:36 total: 3 html: 2 www.google.com.af.access_log 
21/Jul/2014:14:37 total: 3 html: 3 www.google.com.af.access_log 
21/Jul/2014:14:35 total: 2 html: 0 www.google.al.access_log 
21/Jul/2014:14:36 total: 3 html: 2 www.google.al.access_log 
21/Jul/2014:14:37 total: 3 html: 3 www.google.al.access_log 

以上是使用的域文件中帶剛,並與2個域這兩個「.log」文件都與您發佈的樣本相同。