2017-08-25 43 views
0

我有包含許多行的文件,如下面:在awk/GSUB替代的特殊字符和字符串的提取

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">XX:The quick brown fox jumped over the lazy </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">YY:Jack and Jill went up the hill </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">ZZ: Mary had a little lamb </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

我希望提取以下字符串,並丟棄一切。

XX: The quick brown fox jumped over the lazy 
YY: Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

到目前爲止,我已經使用以下awk命令嘗試,但它似乎被限制爲XX需要更換的YY和ZZ。

awk '{gsub(/^.*XX:/,"XX:"); gsub(/[<\a>].*$/,"[</a>].");print}' 

有沒有其他人可以建議使用任何其他標準的Linux工具? 謝謝。

+0

XX/YY/ZZ的通用性如何?如果是這樣,你可以在大多數正則表達式引擎中執行'[XYZ] {2}''。 – stevesliva

+0

@stevesliva,我認爲問題是更多(或也),OP必須改變替換字符串以及哪些字母匹配正則表達式。 – jas

+0

嗨,Jas是正確的,在':'之前替換字符串的變化將是一個要求..感謝您的回覆 – niknak

回答

0

^.XX意味着any character followed by XX at the start of a line - 它不會匹配XX中線。 [<\a>]表示any of the characters <, \, a, or > - 它不會匹配字符串<\a>。找到一個正則表達式教程...

你的問題不清楚,但也許這就是你想要做的?

$ awk '{sub(/<\/a>.*/,""); sub(/.*>/,"")} NF' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

或GNU AWK的第三個參數匹配()打印...之間(假設每行一個)不管的:

$ awk 'match($0,/.*<a[^>]*>(.*)<\/a>.*/,a){print a[1]}' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

這在任何sed的是:

$ sed -n 's/.*<a[^>]*>\(.*\)<\/a>.*/\1/p' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 
0

我猜,這Perl的一個班輪會做(看起來,你是在Linux上):

perl -lne 'print $1 if m{>((XX|YY|ZZ):[^<]*)}' 
+0

感謝所有答覆將嘗試所有的答覆,並留下更新。非常感謝 – niknak

1

如果您的Input_file與所示示例相同,則以下內容也可能對您有所幫助。

awk -F"\">|</a>" 'NF{print $4}' Input_file 

說明:製作"></a>作爲一個字段分隔符(顯然得到什麼OP需要:))。 NF將確保我們應該跳過空行。現在,當我們將字段分隔符設置爲2時,我們可以看到第4個字段將是OP要求的字段,這裏是我們如何看到所有字段的值,並且我們可以選擇OP需要獲得的第4列。

awk -F"\">|</a>" '{for(i=1;i<=NF;i++){print i,$i}}' Input_file 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 XX:The quick brown fox jumped over the lazy 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 YY:Jack and Jill went up the hill 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 ZZ: Mary had a little lamb 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 

我希望這會有所幫助。

+3

在shell中,除非需要雙引號(例如讓變量擴展),否則應該在字符串周圍使用單引號。如果你遵循這個規則,那麼它有很好的一面好處,在這種情況下,當你設置FS時,你不需要跳過雙引號,而不是'-F「\」> | 「」'你應該寫'-F'「> |''。 –