sed：用下劃線替換引號內的空格

我有一些輸入（例如從OpenBSD上的ifconfig run0 scan）有一些字段被空格分隔，但有些字段本身包含空格（幸運的是，包含空格的字段是總是用引號括起來）。sed：用下劃線替換引號內的空格

我需要區分引號內的空格和分隔符空格。這個想法是用下劃線替換引號內的空格。

的樣本數據：

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime 
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime 
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime

這並不最終處理方式我想，因爲我還沒有與下劃線引號但內更換的空間：

%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\ 
    cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4 
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx 
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M 
Websense 6 00:22:7f:xx:xx:xx 59dB 54M

來源

2013-02-16 cnst

試試AWK，它可能是您的解決方案，而不是sed的。 http://stackoverflow.com/questions/3458699/how-to-use-awk-to-extract-a-quoted-field – 2013-02-16 23:10:51

是的，我想我將不得不使用'awk'。但我仍然想用下劃線替換引號內的空格，作爲最終處理的一部分。 – cnst 2013-02-16 23:18:24

檢查這個的SUB部分： http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_92。html 你可以混合我給你解決你的問題的兩個鏈接。 – 2013-02-16 23:27:52

對於sed -only解決方案（我不一定提倡），嘗試：

echo 'a b "c d e" f g "h i"' |\ 
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' 
a b "c_d_e" f g "h_i"

翻譯：

從行首開始。
尋找junk"junk"的模式，重複零次或多次，其中junk沒有報價，然後是junk"junk space。
用_替換最後的空格。
如果成功，跳回到開頭。

來源

2013-02-17 01:57:36

它實際上可以工作！:-)即使在OpenBSD 4.6上有一箇舊的sed沒有'-E'選項，但爲什麼你不得不轉義括號？（儘管我試圖用'（'替換'\（'，並且它停止工作）。另外，爲什麼你不必在第二個'[]'中包含一個空格，例如不是'「[^」] *「'而不是'」[^「] *」'？它怎麼知道不貪心？除此之外，正則表達式本身是非常有意義的！:)所以，'：a'是標籤'a'，並且'ta'跳轉到'a'？並且跳躍意味着倒回應用了搜索/替換的行？Nifty！我必須把它放到我的庫中。:-) – cnst 2013-02-17 04:22:22

@cn替換工作倒退。要查看單個步驟（GNU sed），在替換命令之後放置命令'l0'，即'：a; s /.../.../; l0; ta' – potong 2013-02-17 13:32:55

@potong，great ot她的選項'l0'在我的'sed'中不起作用，但只有'l'，如'; l; ta'，似乎工作得很好，確實表明它處理過貪婪和倒退。在這種情況下，它會變得更好，而不是避免空間使其不貪婪？ – cnst 2013-02-17 17:50:30

試這樣的：

awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file

它適用於在一條線上多的報價部分：

echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" 
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz

編輯替代形式：

awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file

來源

2013-02-16 23:35:57 Kent

hm，在我的tcsh中不起作用：'cat/tmp/ifconfig_scan | fgrep nwid |切-f3 | awk -F''''{for（i = 2; i <= NF; i ++）if（i％2 == 0）gsub（「」，「_」，$ i）;} 1'OFS =「\ 「」 cut -s -d''-f 2,4,6,7,8 | sort -n -k4'返回'Unmatched'。' – cnst 2013-02-16 23:40:50

好吧，這個在tcsh中很好用（只是改變了一些「to」）：'cat/tmp/ifconfig_scan | fgrep nwid |切-f3 | awk -F''''{for（i = 2; i <= NF; i ++）if（i％2 == 0）gsub（「」，「_」，$ i）;} 1'OFS ='「 '| cut -s -d''-f 2,4,6,7,8 | sort -n -k4' – cnst 2013-02-16 23:45:52

@cnst很樂意幫忙.. – Kent 2013-02-16 23:48:07

你最好用perl。代碼是更具可讀性和可維護性：

perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'

隨着你的輸入，結果是：

a b "c_d_e" f g "h_i"

說明：

-p   # enable printing 
-e   # the following expression... 

s    # begin a substitution 

:    # the first substitution delimiter 

"[^"]*"  # match a double quote followed by anything not a double quote any 
       # number of times followed by a double quote 

:    # the second substitution delimiter 

($x=$&)=~s/ /_/g;  # copy the pattern match ($&) into a variable ($x), then 
         # substitute a space for an underscore globally on $x. The 
         # variable $x is needed because capture groups and 
         # patterns are read only variables. 

$x   # return $x as the replacement. 

:    # the last delimiter 

g    # perform the nested substitution globally 
e    # make sure that the replacement is handled as an expression

一些測試：

for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done 

time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null 

real 0m8.301s 
user 0m8.273s 
sys  0m0.020s 

time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null 

real 0m4.967s 
user 0m4.924s 
sys  0m0.036s 

time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null 

real 0m4.336s 
user 0m4.244s 
sys  0m0.056s 

time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null 

real 2m26.101s 
user 2m25.925s 
sys  0m0.100s

來源

2013-02-17 08:00:41 Steve

對不起，但我不同意這個代碼比任何東西都更具可讀性。其他人顯然不同意，但至少對我來說這完全是不可理解的，而且我誠實地試圖弄清楚。你介意添加一個解釋它在做什麼？ – 2013-02-17 10:48:57

謝謝，這有助於很多。我還是不明白的唯一的事情是：1）「=〜」是什麼意思，而不僅僅是「=」，2）「確保替換被處理爲表達式」是什麼意思？3）當你說「return $ x作爲替代品」時 - 將它返回到什麼（就像在awk中分配$ 0一樣）？ – 2013-02-17 12:07:39

@EdMorton：沒問題。很高興我能幫上忙。 1）'=〜'意思是「根據這個正則表達式運行這個變量」。 2）Perl的'e'標誌，就像seds'e'標誌。在父代替中，替代值是第二個（子代）替代。默認情況下，Perl不期望這一點。所以'e'標誌是必需的。 – Steve 2013-02-17 12:12:48

不是一個答案，萬一有人爲@史蒂夫的Perl代碼只是發表的awk等效代碼的興趣（和幫助我在今後記住這一點）：

@steve發佈：

perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'

，並從閱讀@史蒂夫的解釋最短的AWK相當於Perl代碼（不是首選awk的解決方案 - 見@肯特的答案爲）將是GNU AWK：

gawk '{ 
    head = "" 
    while (match($0,"\"[^\"]*\"")) { 
     head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH)) 
     $0 = substr($0,RSTART+RLENGTH) 
    } 
    print head $0 
}'

，我們通過從POSIX awk的解決方案無線開始去個變量：

awk '{ 
    head = "" 
    tail = $0 
    while (match(tail,"\"[^\"]*\"")) { 
     x = substr(tail,RSTART,RLENGTH) 
     gsub(/ /,"_",x) 
     head = head substr(tail,1,RSTART-1) x 
     tail = substr(tail,RSTART+RLENGTH) 
    } 
    print head tail 
}'

和保存與GNU awk的gensub（）行：

gawk '{ 
    head = "" 
    tail = $0 
    while (match(tail,"\"[^\"]*\"")) { 
     x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH)) 
     head = head substr(tail,1,RSTART-1) x 
     tail = substr(tail,RSTART+RLENGTH) 
    } 
    print head tail 
}'

，然後擺脫變量x的：

gawk '{ 
    head = "" 
    tail = $0 
    while (match(tail,"\"[^\"]*\"")) { 
     head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH)) 
     tail = substr(tail,RSTART+RLENGTH) 
    } 
    print head tail 
}'

，然後擺脫變量「尾」，如果你不需要$ 0，NF等，則在循環後留下：

gawk '{ 
    head = "" 
    while (match($0,"\"[^\"]*\"")) { 
     head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH)) 
     $0 = substr($0,RSTART+RLENGTH) 
    } 
    print head $0 
}'

來源

2013-02-17 12:41:44

另一個AWK嘗試：

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"

卸下報價：

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=

有三重大小的測試文件進一步通過@steve完成之前測試的一些額外的測試。我必須稍微轉換一下sed聲明，以便非GNU sed也可以處理它。我包括awk（bwk）gawk3，gawk4和mawk：

$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test 
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null 

real 0m27.802s 
user 0m27.588s 
sys 0m0.177s 
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null 

real 0m6.565s 
user 0m6.500s 
sys 0m0.059s 
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null 

real 0m21.486s 
user 0m18.326s 
sys 0m2.658s 
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null 

real 0m14.270s 
user 0m14.173s 
sys 0m0.083s 
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null 

real 0m4.251s 
user 0m4.193s 
sys 0m0.053s 
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null 

real 0m13.229s 
user 0m13.141s 
sys 0m0.075s 
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null 

real 0m33.965s 
user 0m26.822s 
sys 0m7.108s 
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null 

real 0m15.437s 
user 0m15.328s 
sys 0m0.087s 
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null 

real 0m4.002s 
user 0m3.948s 
sys 0m0.051s 
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null 

real 5m14.008s 
user 5m13.082s 
sys 0m0.580s 
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null 

real 4m11.026s 
user 4m10.318s 
sys 0m0.463s

渲染速度最快的結果mawk ...

來源

2013-02-18 13:44:56 Scrutinizer

不錯！這個問題的最短解決方案，甚至更短n @Steve最短的'perl'片段（雖然在那裏可讀性較差）。我需要放棄'sed'，並學習'awk'！ – cnst 2013-02-18 21:00:57

@crist，謝謝。很高興它有幫助。 – Scrutinizer 2013-02-18 21:26:58

在@steve的測試中還包含了一些額外的測試。 – Scrutinizer 2013-02-20 14:41:47

sed：用下劃線替換引號內的空格

回答

相關問題