2013-11-26 46 views
1

我正面臨着一個與我的源文件有關的問題。考慮我在文件中有以下數據 -源文件分隔符問題

"dfjsdlfkj,fsdkfj,werkj",234234,234234,,"dfsd,etwetr" 

這裏,分隔符是逗號,但有些字段以逗號作爲數據的一部分。這些字段用雙引號引起來。我想從文件中提取幾列。

如果我使用cut -d "," -f 1,3然後我得到的輸出喜歡 -

"dfjsdlfkj,werkj" 
+0

可能重複http://stackoverflow.com/questio ns/7804673 /雙引號內的轉義分隔符) – pfnuesel

+0

如果您的第一個雙引號部分總是有兩個逗號,您可以嘗試'cut -d「,」-f1-3,5' – n0741337

回答

0

你可以嘗試以下方法:

awk -f getFields.awk input.txt 

其中input.txt是輸入文件,並getFields.awk是:

{ 
    split("",a) 
    splitLine() 
    print a[1],a[3] 
} 

function splitLine(s,indq,t,r,len) { 
# Assumptions: 
# * spaces before or after commas are ignored 
# * spaces at beginning or end of line is ignored 

# definition of a quoted parameter: 
# - starts with: (^ and $ are regexp characters) 
# a) ^" 
# b) ," 
# - ends with: 
# a) "$ 
# b) ", 

    s=$0; k=1 
    s=removeBlanks(s) 
    while (s) { 
     if (substr(s,1,1)=="\"") 
      indq=2 
     else { 
      sub(/[[:blank:]]*,[[:blank:]]*"/,",\"",s) 
      indq=index(s,",\"") 
      if (indq) { 
       t=substr(s,1,indq-1) 
       splitCommaString(t) 
       indq=indq+2 
      } 
     } 
     if (indq) { 
      s=substr(s,indq) 
      sub(/"[[:blank:]]*,/,"\",",s) 
      len=index(s,"\",") #find closing quote 
      if (!len) { 
       if (match(s,/"$/)) { 
        len=RSTART-1 
       } 
       else 
        len=length(s) 
       r=substr(s,1,len) 
       s="" 
      } else { 
       r=substr(s,1,len-1) 
       s=substr(s,len+2) 
      } 
      a[k++]=r 
     } else { 
      splitCommaString(s) 
      s="" 
     } 
    } 
    k=k-1 
} 

function splitCommaString(t,b,i) { 
    n=split(t,b,",") 
    for (i=1; i<=n; i++) 
     a[k++]=removeBlanks(b[i]) 
}  

function removeBlanks(r) { 
    sub(/^[[:blank:]]*/,"",r) 
    sub(/[[:blank:]]*$/,"",r) 
    return r 
} 
1

我建議你使用一個csv解析器。例如,有一個作爲一個內置的模塊,所以你只需要進口它:

import sys 
import csv 

with open(sys.argv[1], newline='') as csvfile: 
    csvreader = csv.reader(csvfile) 
    csvwriter = csv.writer(sys.stdout) 
    for row in csvreader: 
     csvwriter.writerow([row[e] for e in (0,2)]) 

假設你的例子線是在一個名爲infile輸入文件,運行腳本:

python3 script.py infile 

即產率:[雙引號內逃逸分離器(的

"dfjsdlfkj,fsdkfj,werkj",234234