awk正則表達式：使用它或不使用變量之間的區別

我有一個awk腳本，當我在不同的地方放置正則表達式時，其行爲會有所不同。很明顯，我讓程序的邏輯在兩種情況下都是一樣的，但事實並非如此。該腳本用於分析每個事務具有唯一ID的一些日誌。日誌看起來像awk正則表達式：使用它或不使用變量之間的區別

timestamp (ID) more info

例如：

2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction 
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with other information 
2014-10-06 05:24:40,035 INFO (4aaaaaaaaabbbbbbcccb) [somestring] body with more information 
2014-10-06 05:24:40,035 INFO (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction

我想是處理特定類型的交易的所有日誌行，看看他們是如何花費的時間。每筆交易分散在多個日誌行中，並由其唯一ID標識。要知道某個交易是否屬於我想要的類型，我必須在該交易的第一行中搜索某個字符串。在日誌中可以是沒有上述格式的行。

我想要什麼：

區分，如果當前行是事務（它有一個ID）

檢查的ID是在累積陣列已註冊的一部分。

如果不是，請檢查它是否具有所需的類型：在行的主體中搜索固定字符串。

如果是，註冊時間戳，等等等等

這裏是代碼（注意，這是一個非常精縮版）。

這是我想用，首先檢查它是否是一個交易行和檢查後，如果它是正確的類型

awk '$4 ~ /^\([:alnum:]/ { name=$4;gsub(/[()]|:.*/,"",name);++matched if(!(name in arr)){ if($0 ~ /transaction type/){arr[name]=1;print name}} }END { print "Found :"length(arr) print "Processed "NR print matched" lines matched the filter" }'

該腳本只發現868個交易什麼，有一些超過14K。如果我將腳本更改爲如下所示的代碼，如果找到所有14k事務，但僅查找所有這些事務的第一行，那麼對我來說沒有用處。

awk '/transaction type/ { name=$4;gsub(/[()]|:.*/,"",name);++matched if(!(name in arr)){ arr[name]=1;print name } }END { print "Found :"length(arr) print "Processed "NR print matched" lines matched the filter" }'

在此先感謝。

編輯

對我感到羞恥。這個話題有不止一個實際問題。主要的是正則表達式不匹配正確的字符串。 ID字符串和事務字符串的類型在同一行上，這是真的，但在這些行上，ID是（aaaaaabbbbbcccc：），最後有兩個空格。這使得AWK將「（aaaaaaaabbbbcccc：」和「）」解析爲作爲兩個不同的字段。我意識到當我做

$4 !~ /regex/ print $4

和大量有效的ID出現。

修正正則表達式後出現的第二個問題已經在這裏被一些人解決了。主要的正則表達式和冷杉（分隔線讓awk打印每條記錄。我意識到自己和同一天後我在這裏讀到的解決方案。驚人。

非常感謝每一個人。我只能接受一個有效的答案，但我從他們中學到了很多東西。

來源

2014-10-06 Danielo515

您可以考慮使用logstash與神交和多過濾器這樣的工作。我很不確定你輸入的內容是什麼，因爲在你的例子中只有一行格式。 – Tensibai 2014-10-06 12:35:54

你好。我無法安裝比可用的程序更多的程序。我只對符合上述格式的行感興趣，所以IMO沒有問題。我不知道所有的線路是怎樣的，但這根本不重要。 – Danielo515 2014-10-06 12:53:57

'/ transaction type /'與您的示例輸入行不匹配。這使得很難確定哪些可能是錯誤的。你能給我們實際的日誌行和你匹配的實際字符串/正則表達式嗎？ – 2014-10-06 12:57:24

空白問題毫無頭緒。這：

/foo/ { 
    print "found" 
}

手段print 'found' every time "foo" is present而這一點：

/foo/ 
{ 
    print "found" 
}

意味着print the current record every time "foo" is present and print "found" for every single input record因此機會是當你寫道：

$4 ~ /^\([:alnum:]/ 
{ 
    .... 
}

你真正的意思是寫：

$4 ~ /^\([:alnum:]/ { 
    .... 
}

同時，機會是你的意思是使用POSIX字符類[[:alnum:]]代替字符集[ : a l n u m的描述由字符集[:alnum:]：

$4 ~ /^\([[:alnum:]]/ { 
    .... 
}

如果你解決這些事情，你仍然需要幫助，提供一些可檢驗的樣品輸入和預期輸出我們可以幫助您更多。

來源

2014-10-06 13:24:48

對於線路返回，我確實認爲這是爲了便於閱讀而在這裏發佈的格式，但值得注意。我在爆炸我的例子時做了同樣的錯誤:) – Tensibai 2014-10-07 11:57:22

這只是一個語法錯誤。當您使用POSIX字符類，你必須用方括號內：

[[:alnum:]]

否則[:alnum:]被看作是包含: a l m n u

來源

2014-10-06 13:16:36

所以在短暫的字符類，如果我正確理解你希望得到的IDS某些類型的交易。

第一個假設：ID和交易類型是在同一條線上，這樣的事情應該做的（主要來自你的代碼改編）從您的樣品輸入

awk 'BEGIN { 
    matched=0 # more for clarity than really needed 
} 
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only 
    gsub(/[()]/,"",$4) # strip the() around the id 
    ++matched # to get the number of matched lines including the multiples ones. 
    if (!($4 in arr)) { # as yours, if the id is not in array 
    arr[$4]=1 # add the found id to array for no including it twice 
    print $4 # print the found id (only once as we're in the if 
    } 
} 
END { # nothing changed here, printing the stats... 
    print "Found :"length(arr) 
    print "Processed "NR 
    print matched" lines matched the filter" 
}'

輸出的是：

prompt=> awk 'BEGIN { matched=0};/\([a-z0-9]*\)/{ gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput 
4aaaaaaaaabbbbbbcccb 
4xxbbbbbbbbbbbbbcccb 
Found: 2 
Processed 4 
4 lines matched the filter

我已經在測試中忽略，則交易，因爲我已經什麼它可能會在AWK

來源

2014-10-06 13:19:12 Tensibai

awk正則表達式：使用它或不使用變量之間的區別

編輯

回答

相關問題