2017-07-27 26 views
1

我有一個大文件(超過1000行),我需要按照一些條件進行排序。 文件包含這樣行:按照數字排序文件後,從特定的字bash

bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} [] 
bla took 531ms. {"uniqueId":"wdagweg","duration":531} [] 
[2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} [] 
[2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} [] 
[2017-07-26 1wergwg local took 39ms. {"uniqueId"wetgwgweqg gg} 

需要通過數字後對它們進行排序與​​awk的「花」 我可以通過對它們進行排序:awk '{for(i=1;i<=NF;i++) if ($i=="took") print $(i+1)}' test | sort -h 但對於輸出,我從全行的需要,剛需排序不丟失任何東西。不幸的是,這個mss不在同一列(很容易)。

如果最好比本地bash解決方案(更快/更簡單/更正確),那麼需要調用另一個解釋器(perl,python等)的解決方案。

+0

使用'{}'按鈕的代碼格式的塊文字內容(即實際的代碼或數據)。你可以使用這個塊上面的標籤'<! - language:none - >來告訴站點不要嘗試語法高亮數據。 –

+0

順便說一句,請*不要*標記一種語言,只是因爲你猜測它可能有幫助。這是惹惱很多人的好方法(訂閱標籤是因爲他們想回答有關該語言的問題)。 –

+0

您的代碼缺失的部分是:刪除「ms」。從下一個單詞;並打印該行:'awk'{for(i = 1; i <= NF; i ++){if($ i ==「took」){n = $(i + 1); sub(/ ms \ ./,「」,n);打印n,$ 0;接下來}}}'文件' - 然後按其他人指出的那樣排序並刪除第一個單詞。 –

回答

3

這樣做的簡單方法是將要搜索的數據提取到列中,對其進行排序,然後刪除另一個管道元素中的該列。

因此,作爲一個直接步驟:

gawk 'match($0, /took ([[:digit:]]+)/, m) {printf("%s\t%s\n", m[1], $0)}' 

這會讓你流的樣子:

536 bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} [] 
531 bla took 531ms. {"uniqueId":"wdagweg","duration":531} [] 
47 [2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} [] 
41 [2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} [] 
39 [2017-07-26 1wergwg local took 39ms. {"uniqueId"wetgwgweqg gg} 

...此時您可以通過它通過sort -n排序上的數字在開始處,然後到一個管道元素,該管道元素除去該主導值:

gawk 'match($0, /took ([[:digit:]]+)/, m) {printf("%s\t%s\n", m[1], $0)}' \ 
| sort -n | cut -d $'\t' -f 2- 

...我們有我們的輸出:

[2017-07-26 1wergwg local took 39ms. {"uniqueId"wetgwgweqg gg} 
[2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} [] 
[2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} [] 
bla took 531ms. {"uniqueId":"wdagweg","duration":531} [] 
bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} [] 
+2

也被稱爲[Schwartzian變換](https://en.wikipedia.org/wiki/Schwartzian_transform)。 – chepner

+0

非常感謝。這是工作輝煌 –

+0

@glennjackman,這是在可讀性和效率的基礎上的重大改進;採納,非常感謝。 –

1
gawk '/took/ { cnt++;match($0,/[[:digit:]]+ms/);num=substr($0,RSTART,RLENGTH-2);tim[num]=$0 } END { asorti(tim,tim1);for (i=1;i<=cnt;i++) print tim[tim1[i]] }' filename 

作爲替代,使用GAWK更簡潔的方法是使用時間戳作爲一個數組(TIM)的索引,然後使用功能asorti將索引排序到另一個數組中(tim1),然後使用tim1中的排序索引來提取tim中的數據。

輸出:

[2017-07-26 1wergwg local took 39ms. {"uniqueId"wetgwgweqg gg} 
[2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} [] 
[2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. 
{qwgwqgce":"local","duration" :47} [] 
bla took 531ms. {"uniqueId":"wdagweg","duration":531} [] 
bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} [] 
+0

我在這裏可以看到的唯一缺點是它需要數據放在RAM中,而在處理大於可用RAM的數據時,GNU'sort'可以將其緩存到單獨排序的臨時文件(然後執行合併排序以合併它們) 。 –

2

用Perl,可以編寫

perl -e ' 
    while (<>) { 
     if (/took (\d+)/) { 
      push @{$lines{$1}}, $_; 
     } 
    } 
    for $num (sort {$a <=> $b} keys %lines) { 
     print join("", @{$lines{$num}}); 
    } 
' file 

,或者如線噪聲

perl -lnE'/took (\d+)/&&[email protected]{$l{$1}},$_}END{[email protected]{$l{$_}}for sort{$a<=>$b}keys%l' file 
+0

謝謝,它的工作也很完美 –