2012-08-25 66 views
1

我有一個日誌文件,其中有一堆行,其中每一堆都由空行分隔。我想從每一行中挑選特定的行(包含常見模式)。每一行都是關於郵件的。樣品日誌文件如下:使用awk grep從每一堆多行

#START# 
03:48:19:798: : <23/08/2012 03:48:19:019> 
03:48:19:798: : <---23/08/2012 03:48 ---> 
03:48:19:799: : MAIL FROM IP=1.2.3.4 
03:48:19:799: : START CHECKING OF IPLIMIT 
03:48:19:799: : STOP CHECKING OF IPLIMIT 
03:48:20:848:In : MAIL FROM: <[email protected]> 
03:48:20:848: : [A:A:A] 
03:48:20:849: : max attach size-->5242880 
03:48:20:856: : User Is Authenticated with "[email protected] and domain abc.com" 
03:48:20:856: : Passed 
03:48:20:987:In : RCPT TO: <[email protected]> 
03:48:20:987: : email [email protected] 
03:48:20:992: : [A:A:A] 
03:48:20:999: : passed 
03:48:20:999:Inside the Store Mails 
03:48:20:999: : BCC feature is not applicable [email protected] 
03:48:21:000: : BCC feature is not applicable from [email protected] 
03:48:21:000:Inside the Store 
03:48:21:132:In : RCPT TO: <[email protected]> 
03:48:21:132: : email [email protected] 
03:48:21:133: : [A:A:A] 
03:48:21:140: : passed 
03:48:21:140:Inside the Store Mails 
03:48:21:140: : BCC feature is not applicable [email protected] 
03:48:21:140: : not authenticated 
03:48:21:140:Inside the Store 
03:48:21:271: : Data Received 
03:50:32:049: : 552 Size Limit Exceeded(5242880) 
03:50:32:049: : File Moved in LargeSize Folder.... 
03:50:32:049: : File Moved in LargeSize Folder.... 
03:50:32:049: : Connection closed 
03:50:32:049: : File Deleted /home/Mail//mailbox/LargeSize/[email protected]:24085.444724474357(1345673901000) 
03:50:32:051: : File Deleted /home/Mail//mailbox/LargeSize/[email protected]:39872.512978520455(1345673901140) 
MAIL DATA : : 6815779 Bytes 
Total: Conn : 16713 Quit By Host : 5565 Stored : 11134 Loop:0 
#END# 
W A R N I N G ---------------W A R N I N G 

...Waiting for activity on port Total Thread Started & 16732 Stoped 16730 
#START# 
03:56:20:790: : <23/08/2012 03:56:20:020> 
03:56:20:790: : <---23/08/2012 03:56 ---> 
03:56:20:791: : MAIL FROM IP=2.3.4.5 
03:56:20:792: : IP IS FRIEND IN WHITELIST 
03:56:20:834:In : MAIL FROM:<[email protected]> 
03:56:20:834: : [A:A:A] 
03:56:20:834: : null 
03:56:20:834: : Passed 
03:56:20:834:In : RCPT TO: <[email protected]> 
03:56:20:834: : email [email protected] 
03:56:20:835: : Mailing List 
03:56:20:835: : [A:A:A] 
03:56:20:836: : passed 
03:56:20:836: : Proceesing maillist 
03:56:20:839: : Data Received 
03:56:20:865: : /home/Mail//mailbox/MailingList/[email protected]:79602.39544573233(1345674380836) Msg Queued For Delivery 
03:56:20:865: : Msg forward successfully 
03:56:20:865: : /home/Mail//mailbox/MailingList/M14310.39892966699(1345674380837) Msg Queued For Delivery 
MAIL DATA : : 27985 Bytes 
Total: Conn : 16732 Quit By Host : 5582 Stored : 11135 Loop:0 
#END# 

...Waiting for activity on port Total Thread Started & 16735 Stoped 16731 
#START# 
03:56:23:957: : <23/08/2012 03:56:23:023> 
03:56:23:957: : <---23/08/2012 03:56 ---> 
03:56:23:958: : MAIL FROM IP=2.3.4.5 
03:56:23:959: : IP IS FRIEND IN WHITELIST 
03:56:23:999:In : MAIL FROM: <[email protected]> 
03:56:23:999: : [A:A:A] 
03:56:23:999: : null 
03:56:23:999: : Passed 
03:56:23:999:In : RCPT TO: <[email protected]> 
03:56:23:999: : email [email protected] 
03:56:24:000: : [A:A:A] 
03:56:24:007: : passed 
03:56:24:008:Inside the Store Mails 
03:56:24:009: : BCC feature is not applicable [email protected] 
03:56:24:009: : not authenticated 
03:56:24:009:Inside the Store 
03:56:24:009: : Data Received 
03:56:24:053: : /home/Mail//mailbox/External/[email protected]:50098.70335800691(1345674384009) Msg Queued For Delivery 
03:56:24:054: : Msg forward successfully 
MAIL DATA : : 28276 Bytes 
Total: Conn : 16735 Quit By Host : 5582 Stored : 11136 Loop:0 
#END#  

這裏,[email protected]是一個外部郵件ID,並[email protected][email protected]是內部郵件的ID。 對於每封郵件,都會生成從#START#到#END#開始的一堆行。

從每一行我想要運行一些模式匹配。我只想要那些郵件從內部電子郵件ID到外部電子郵件ID(第二行)的行。

我不想在郵件來自外部電子郵件地址/編號到內部電子郵件編號(第1行),或從內部電子郵件編號到內部電子郵件編號的一堆行。 (第三行)。

而且我有一堆郵件是從內部到外部的行,我想提取包含單詞FROMTO的行。

我試着用AWK的RSORSFSOFS變量線的每串轉換,從開始到結束#START#使單行記錄,但不能。我無法用|~等分隔符替換換行符。另外,我現在不會如何在每個資源記錄上運行多個模式匹配。

我試過使用/PATTERN/選項,但後來無法運行grep命令使用system()函數來獲取行來檢查域名。它給了我錯誤:sh: 1: not found。無法突破它。我使用的代碼:

if ($0 ~ /FROM/) { print $0 | system("egrep -i 'FROM|TO'") } 

另外,如果我嘗試使用以下類型的代碼導出的每個記錄,它不工作:

for i in $(cat log_file | awk_file_givin_1_resource_record_at_a_time) ; do pattern_matching_commands ; done 

這是沒有工作引起的模式匹配正在線我一次希望它能夠在整個團隊中工作。

+0

這感覺有點過於寬泛;這裏有很多個人問題。我會盡力將問題分解成幾個步驟,並分別解決每一步驟。如果您在某個步驟中遇到問題,那麼可以在這裏提出一個更好,更有針對性的問題。 – chepner

+0

@chepner:我認爲,如果我設法將變量中的每一行都帶到變量中,並將該變量與bash命令一起使用,那麼我可能會對其執行正常的bash操作(儘管很多)來提取我想要的信息。 –

回答

2

我認爲以下bash腳本會工作得很好,但你應該基準它爲你的日誌的大小:

#!/bin/bash 

INTERNAL_DOMAINS="${1:-xyz.com|xyz.net}" 

declare -i LINES BYTES VALIDS 
LINES=0 
BYTES=0 
VALIDS=0 
STATUS=stopped 
while read LINE 
do 
    if [ "$STATUS" = stopped ] 
    then 
    if [ "${LINE:0:7}" = "#START#" ] 
    then   
     STATUS=started  
     PARA=""   
    fi   
    else 
    if [ "${LINE:0:5}" = "#END#" ] 
    then   
     if [ $STATUS = valid ] 
     then    
     VALIDS+=1    
     echo "$PARA" | egrep -w "FROM|TO" 
     echo -e "$VALIDS matched\t----------------------------------------" 
     fi     
     STATUS=stopped  
    elif (echo "$LINE" | fgrep -q "RCPT TO") && (echo "$LINE" | egrep -qiv "@($INTERNAL_DOMAINS)") 
    then   
     STATUS=valid  
     PARA+="$LINE  
" 
    else   
     PARA+="$LINE  
" 
    fi   
    fi 

    LINES+=1 
    BYTES+=${#LINE} 
    BYTES+=1 
    echo -en "\rRead: lines: $LINES | kB: $(($BYTES/1024)) | matches: $VALIDS " >&2 
done 

您應該設置上面的腳本爲可執行文件,這樣運行了它獲得進度輸出:

time ./filter.sh "one.int.com|two.int.com" <sample.log> report.out 
+0

這對我有用。非常感謝。必須承認,我有很長的路要走,作爲程序員或scripter ..... –

2

如果記錄之間總是有空行,並且記錄中沒有空行,請使用awk的「段落模式」:將RS設置爲空字符串。

awk -v RS= ' 
    /^[0-9:]*In : MAIL FROM: <[^<>]*@example\.com>$/ && 
    /^[0-9:]*In : RCPT TO: <[^<>]*@example\.com>$/ { … } 
' 

如果你真的需要使用#START##END#標記,積累數據變量,當您去。當您到達#END#時,處理會重置變量。如有必要,禁用處理,直到下一個#START#

BEGIN { in_record = 1; } 
/^#START#$/ { in_record = 1; } 
!in_record { next; } 
/^[0-9:]*In : MAIL FROM: <([^<>]*)>$/ { from = $0; sub(/.*</, "", from); sub(/>.*/, "", from); } 
… 
/^#END#$/ { 
    /* processing goes here */ 
    from = ""; 
    in_record = 0; 
} 
+0

感謝您的快速回復:),但我無法知道如何使用您的第一個解決方案;第二個看起來非常複雜(請不要採取任何違法行爲)。如果我的日誌文件名是log_20120824.log,那麼用法是什麼。我試着用管道輸出文件到第一個解決方案,替換域名,什麼都沒有發生。還使用「打印$ 0」,即使是打印任何東西。 –