2012-09-30 116 views
2

我有一個文件:Awk - 如何提高正則表達式?

@Book{gjn2011ske, 
    author = {Grzegorz J. Nalepa}, 
    title = {Semantic Knowledge Engineering. A Rule-Based Approach}, 
    publisher = {Wydawnictwa AGH}, 
    year =  2011, 
    address = {Krak\'ow} 
} 

@article{gjn2010jucs, 
    Author = {Grzegorz J. Nalepa}, 
    Journal = {Journal of Universal Computer Science}, 
    Number = 7, 
    Pages = {1006-1023}, 
    Title = {Collective Knowledge Engineering with Semantic Wikis}, 
    Volume = 16, 
    Year =  2010 
} 

我想提高正則表達式,只有刪除的第一道防線。 備註:記錄分隔符RS="}\n"不能更改。

我想:

awk 'BEGIN{ RS="}\n" } {gsub(/@.*,/,"") ; print }' file 

我想打印結果:

author = {Grzegorz J. Nalepa}, 
    title = {Semantic Knowledge Engineering. A Rule-Based Approach}, 
    publisher = {Wydawnictwa AGH}, 
    year =  2011, 
    address = {Krak\'ow} 

    Author = {Grzegorz J. Nalepa}, 
    Journal = {Journal of Universal Computer Science}, 
    Number = 7, 
    Pages = {1006-1023}, 
    Title = {Collective Knowledge Engineering with Semantic Wikis}, 
    Volume = 16, 
    Year =  2010 

謝謝您的幫助。

編輯:

我提出的解決方案:

awk 'BEGIN{ RS="}\n" }{sub(",","@"); sub(/@.*@/,""); print }' file 

回答

2

很難完成你想要與指定RS設置什麼(因爲address = {Krak\'ow}有結束一個額外的記錄)。我寧願與:

awk '$0 !~ "^@" && $0 !~ "^} *$" { print }' FILE 

查看它in action here

編輯我不知道爲什麼它必須與正則表達式解決方案,請你解釋一下嗎?

反正,但它使用正則表達式(一個或多個)其他(working, see here)解決方案,而不是你期待:的那些

awk 'BEGIN{ RS="}\n" } 
{ 
    split($0,a,"\n") 
    for (e=1;e<=length(a);e++) { 
     if (a[e] ~ "{" && a[e] !~ "}") { 
      sub("$","}",a[e]) 
     } 
     if (a[e] ~ "=") { print a[e] } 
    } 
    printf("\n") 
}' INPUTFILE 

多個,以更簡單的正則表達式,但它失敗,並「address」行作爲最後}將與您RS被刪除,它會顯示最終} ...

awk 'BEGIN{ RS="}\n" } 
{ 
    sub("@[^,]\+,","") 
    print $0 
}' INPUTFILE 
+0

謝謝你的解決方案,但等待的是,例如,一個正則表達式。查看我的編輯和建議的解決方案。 – Tedee12345

+0

還有其他解決方案。 –

+0

再次感謝您的回覆。您首先提出的解決方案適合我。 – Tedee12345

2

的一種方式,而無需使用正則表達式。將字段分隔符設置爲換行符,現在每個註冊鍵都將是一個字段。就這樣,遍歷每個字段並打印那些不與@開始:

awk ' 
    BEGIN { 
     RS="}\n"; 
     FS=OFS="\n"; 
    } 
    { 
     for (i=1; i<=NF; i++) { 
      if (substr($i, 1, 1) != "@") { 
       printf "%s%s", $i, (i == NF) ? RS : OFS; 
      } 
     } 
    } 
' file 

輸出:

author = {Grzegorz J. Nalepa}, 
title = {Semantic Knowledge Engineering. A Rule-Based Approach}, 
publisher = {Wydawnictwa AGH}, 
year =  2011, 
address = {Krak\'ow} 

Author = {Grzegorz J. Nalepa}, 
Journal = {Journal of Universal Computer Science}, 
Number = 7, 
Pages = {1006-1023}, 
Title = {Collective Knowledge Engineering with Semantic Wikis}, 
Volume = 16, 
Year =  2010 
+0

感謝您的解決方案。你的例子在最後一行的末尾留下了「}」。查看我的編輯和建議的解決方案。 – Tedee12345

2

我會用GNU sed做到這一點:

sed '/^@/,/^}$/ { //d }' file.txt 

結果:

author = {Grzegorz J. Nalepa}, 
    title = {Semantic Knowledge Engineering. A Rule-Based Approach}, 
    publisher = {Wydawnictwa AGH}, 
    year =  2011, 
    address = {Krak\'ow} 

    Author = {Grzegorz J. Nalepa}, 
    Journal = {Journal of Universal Computer Science}, 
    Number = 7, 
    Pages = {1006-1023}, 
    Title = {Collective Knowledge Engineering with Semantic Wikis}, 
    Volume = 16, 
    Year =  2010 

請注意,您可以使用-i標誌進行就地更改(即,覆蓋文件內容),您可以使用-s標誌對多個文件進行更改。例如:

sed -s -i '/^@/,/^}$/ { //d }' *.txt 
+0

謝謝你的解決方案,但仍然等待,例如,一個正則表達式。查看我的編輯和建議的解決方案。 – Tedee12345

+0

@ Tedee12345:無法更改awk的記錄分隔符會產生比解決問題更多的問題。而圍繞這些問題編碼從來不是一個好主意。你應該考慮發佈你爲什麼認爲保留'RS =「} \ n」'是個好主意。如果是這樣,請包含更多樣本數據。祝你好運。 – Steve

+0

再次感謝您的解釋。 – Tedee12345

1
awk '{if($0!~/@/&&$0!~/^}/)print}' temp 

如下測試:

> awk '{if($0!~/@/&&$0!~/^}/)print}' temp 
    author =  {Grzegorz J. Nalepa}, 
    title =  {Semantic Knowledge Engineering. A Rule-Based Approach}, 
    publisher = {Wydawnictwa AGH}, 
    year =   2011, 
    address =  {Krak\'ow} 

    Author =  {Grzegorz J. Nalepa}, 
    Journal =  {Journal of Universal Computer Science}, 
    Number =  7, 
    Pages =  {1006-1023}, 
    Title =  {Collective Knowledge Engineering with Semantic Wikis}, 
    Volume =  16, 
    Year =   2010 
> 
+0

這個答案與20小時前Zolts的答案几乎相同。你應該考慮對它進行投票,就像我有。 – Steve

+0

感謝您提供解決方案的例子。 – Tedee12345