2013-01-02 36 views
0

我需要使用「FPAT」或gawk的等效功能「patsplit」。但是似乎我們的CentOs服務器上安裝的gawk版本是3.1.5。Csv可選報價與gawk

我試着更新GAWK這些命令:

yum update gawk; 

而且服務器顯示: 「標記不包更新」

我也嘗試重新安裝GAWK有:

yum install gawk; 

服務器輸出:「已安裝包gawk-3.1.5-15.el5.x86_64和最新版本 」

我需要gawk 4.0或更高版本才能使用這些FPAT或patsplit。爲什麼我需要使用它們?以及我試圖處理一個CSV文件,並且它似乎CSV文件有可選的引號和嵌入式逗號。

例子:

從CSV行是這樣的:

this,is,a,"csv,with,embedded coma" 

我需要拆分的字段是這樣的:

this 

is 

a 

"csv,with,embedded comma" 

這裏是GAWK代碼:

awk '{patsplit("this,is,a,\"csv,with,embedded comma\"",a,"([^,]*)|(\"([^\"]|\"\")+\"[^,]*)",seps); for(i=0;i<length(a);i++) print a[i];}'; 

任何人都可以幫助我請這個嗎?

回答

2

嘗試在管道中使用csvquote使數據易於awk解釋。這是我寫的腳本,用非打印字符替換帶引號的字段中的逗號,然後恢復它們。

所以,如果你的awk命令看起來像這樣原本:

awk -F, '{print $3 "," $5}' inputfile.csv 

...它可以製成具有CSV援引分離的工作是這樣的:

csvquote inputfile.csv | awk -F, '{print $3 "," $5}' | csvquote -u 

代碼和詳細的文檔,請https://github.com/dbro/csvquote

+0

優秀!!!哇!! –

1

我想我們可以使用match()來獲取字段。

這裏是代碼:

awk '{ $0=$0","         
while($0) { 
    match($0,/ *"[^"]*" *,|[^,]*,/) 
    field=substr($0,RSTART,RLENGTH)    
    gsub(/,$/,"",field) 
    print field 
    $0=substr($0,RLENGTH+1)    
}}' file 

測試你的輸入例如:

kent$ echo 'this,is,a,"csv,with,embedded coma"'|awk '{ 
$0=$0","         
while($0) { 
    match($0,/ *"[^"]*" *,|[^,]*,/) 
    field=substr($0,RSTART,RLENGTH)    
    gsub(/,$/,"",field) 
    print field 
    $0=substr($0,RLENGTH+1)    
}}' 
this 
is 
a 
"csv,with,embedded coma" 
1

做最簡單的事情是轉換逗號引​​號以外的東西,你做你的真實事情之前處理。例如:

$ cat file 
this,is,a,"csv,with,embedded coma",and,here,"is,another",one 
and,here,"is,another,line" 
$ 
$ awk 'BEGIN{FS=OFS="\""}{for (i=1;i<=NF;i+=2) gsub(/,/,";",$i)}1' file 
this;is;a;"csv,with,embedded coma";and;here;"is,another";one 
and;here;"is,another,line" 

如果你不喜歡「;」 S作爲字段分隔符,選擇別的東西就像一個控制字符或下面是一個使用換行符的FSS和空行作爲RS的一個例子:

$ awk 'BEGIN{FS=OFS="\""; ORS="\n\n"}{for (i=1;i<=NF;i+=2) gsub(/,/,"\n",$i)}1' file 
this 
is 
a 
"csv,with,embedded coma" 
and 
here 
"is,another" 
one 

and 
here 
"is,another,line" 

$ awk 'BEGIN{FS=OFS="\""; ORS="\n\n"}{for (i=1;i<=NF;i+=2) gsub(/,/,"\n",$i)}1' file | 
awk -F'\n' -v RS= '{for (i=1;i<=NF;i++) print NR,i,"<" $i ">"}' 
1 1 <this> 
1 2 <is> 
1 3 <a> 
1 4 <"csv,with,embedded coma"> 
1 5 <and> 
1 6 <here> 
1 7 <"is,another"> 
1 8 <one> 
2 1 <and> 
2 2 <here> 
2 3 <"is,another,line"> 

只有嵌入換行符或嵌入的轉義雙引號纔會變得棘手。

+0

噢,** + 1 **代替在引號隔開的每個第二個字段!我會保留那一個。 :) – ghoti

0

這是一個純粹的GAWK解決方案:

{ # Split on double quotes to handle lines like "this, this, or this". 
    printf("LINE: '%s'\nFIELDS:", $0) 
    n = split($0,q,/"/) 
    f = 0 
} 

n == 1 { # If n is 1, there are no double quotes on the line. 
    n = split($0,c,/,/) 
    for (i = 1; i <= n; i++) { 
     printf(" %d='%s'", i, c[i]) 
    } 
    printf("\n") 
    next 
} 

{ # There are "strings"; the EVEN entries in q are the quoted strings. 
    for (i = 1; i <= n; i++) { 
     if (0 == and(i,1)) { # i is EVEN: This is a double-quoted string. 
      printf(" %d='\"%s\"'", ++f, q[i]) 
      continue 
     } 
     if (0 == length(q[i])) { # First/last field is a quoted string. 
      continue 
     } 
     if (q[i] == ",") { 
      # First/last field empty, or comma between two quoted strings. 
      if (i == 1 || i == n) { # First/last field empty 
       printf(" %d=''", ++f) 
      } 
      continue 
     } 
     # Remove commas before/after a quoted string then split on commas. 
     sub(/^,/,"",q[i]) 
     sub(/,$/,"",q[i]) 
     m = split(q[i],cq,/,/) 
     for (j = 1; j <= m; j++) { 
      printf(" %d='%s'", ++f, cq[j]) 
     } 
    } 
    printf("\n") 
} 

有了這個輸入:

This is one,23,$9.32,Another string. 
Line 2,234,$88.34,Blah blah 
"This is another",763,$0.00,"trouble, or not?" 
"This is, perhaps, trouble too...",763,$0.00,"trouble, or not?" 
2,"This is, perhaps, trouble too...",763,"trouble, or not?" 
3,,"number, number","well?" 
,,, 
"1,one","2,two","3,three","4,four" 
",commas,","no commas",",,,,,", 
,"Fields 1 and 4 are empty","But 2 and 3 are not", 

此輸出產生:

LINE: 'This is one,23,$9.32,Another string.' 
FIELDS: 1='This is one' 2='23' 3='$9.32' 4='Another string.' 
LINE: 'Line 2,234,$88.34,Blah blah' 
FIELDS: 1='Line 2' 2='234' 3='$88.34' 4='Blah blah' 
LINE: '"This is another",763,$0.00,"trouble, or not?"' 
FIELDS: 1='"This is another"' 2='763' 3='$0.00' 4='"trouble, or not?"' 
LINE: '"This is, perhaps, trouble too...",763,$0.00,"trouble, or not?"' 
FIELDS: 1='"This is, perhaps, trouble too..."' 2='763' 3='$0.00' 4='"trouble, or not?"' 
LINE: '2,"This is, perhaps, trouble too...",763,"trouble, or not?"' 
FIELDS: 1='2' 2='"This is, perhaps, trouble too..."' 3='763' 4='"trouble, or not?"' 
LINE: '3,,"number, number","well?"' 
FIELDS: 1='3' 2='' 3='"number, number"' 4='"well?"' 
LINE: ',,,' 
FIELDS: 1='' 2='' 3='' 4='' 
LINE: '"1,one","2,two","3,three","4,four"' 
FIELDS: 1='"1,one"' 2='"2,two"' 3='"3,three"' 4='"4,four"' 
LINE: '",commas,","no commas",",,,,,",' 
FIELDS: 1='",commas,"' 2='"no commas"' 3='",,,,,"' 4='' 
LINE: ',"Fields 1 and 4 are empty","But 2 and 3 are not",' 
FIELDS: 1='' 2='"Fields 1 and 4 are empty"' 3='"But 2 and 3 are not"' 4=''