2013-05-01 61 views
40

我想導入兩種CSV文件,一些使用「;」分隔符和其他使用「,」。到目前爲止,我一直在接下來的兩行之間切換:我可以導入CSV文件並自動推斷分隔符嗎?

reader=csv.reader(f,delimiter=';') 

reader=csv.reader(f,delimiter=',') 

是否可以不指定分隔符,讓爲正確的分隔符程序檢查?

下面的解決方案(Blender和sharth)似乎適用於逗號分隔文件(使用Libroffice生成),但不適用於使用分號分隔文件(使用MS Office生成)。下面是一個分號分隔文件的第一行:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 
+0

你好,一個更一般的討論(不是在python中)也是在https://stackoverflow.com/questions/2789695/how-to-programmatically-guess-whether-a-csv-file-is-comma-or-用分號分隔 – Lorenzo 2018-01-03 15:56:42

回答

6

爲了解決這個問題,我創建了一個讀取文件(標題)的第一行並檢測分隔符的函數。

def detectDelimiter(csvFile): 
    with open(csvFile, 'r') as myCsvfile: 
     header=myCsvfile.readline() 
     if header.find(";")!=-1: 
      return ";" 
     if header.find(",")!=-1: 
      return "," 
    #default delimiter (MS Office export) 
    return ";" 
+5

如果分隔符是值的一部分,即使它被掃描或引用,您的函數也不會工作。例如,「嗨,彼得;」,「你好嗎?」,「再見約翰!」'會返回';'作爲分隔符,這是錯誤的。 – tashuhka 2016-10-06 13:23:13

42

csv模塊似乎使用csv sniffer針對此問題建議。

他們給出了下面的例子,我已經適應你的情況。

with open('example.csv', 'rb') as csvfile: # python 3: 'r',newline="" 
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,") 
    csvfile.seek(0) 
    reader = csv.reader(csvfile, dialect) 
    # ... process CSV file contents here ... 

讓我們試試吧。

[9:13am][[email protected] /tmp] cat example 
#!/usr/bin/env python 
import csv 

def parse(filename): 
    with open(filename, 'rb') as csvfile: 
     dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') 
     csvfile.seek(0) 
     reader = csv.reader(csvfile, dialect) 

     for line in reader: 
      print line 

def main(): 
    print 'Comma Version:' 
    parse('comma_separated.csv') 

    print 
    print 'Semicolon Version:' 
    parse('semicolon_separated.csv') 

    print 
    print 'An example from the question (kingdom.csv)' 
    parse('kingdom.csv') 

if __name__ == '__main__': 
    main() 

而且我們的樣本輸入

[9:13am][[email protected] /tmp] cat comma_separated.csv 
test,box,foo 
round,the,bend 

[9:13am][[email protected] /tmp] cat semicolon_separated.csv 
round;the;bend 
who;are;you 

[9:22am][[email protected] /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 

如果我們執行示例程序:

[9:14am][[email protected] /tmp] ./example 
Comma Version: 
['test', 'box', 'foo'] 
['round', 'the', 'bend'] 

Semicolon Version: 
['round', 'the', 'bend'] 
['who', 'are', 'you'] 

An example from the question (kingdom.csv) 
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes'] 
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] 
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] 

它也可能是值得關注我使用的Python版本。

[9:20am][[email protected] /tmp] python -V 
Python 2.7.2 
+0

這適用於逗號分隔文件,但不能正確讀取以分號分隔的文件(無法確定分隔符)。看到我上面的編輯... – rom 2013-05-01 06:39:41

+0

它似乎爲我工作。我會擴大答案。 – 2013-05-01 14:05:49

+0

我已經包含了一個以逗號分隔,以分號分隔的示例,以及您在問題中提出的示例文件。 – 2013-05-01 14:25:10

2

我不認爲有可能是這完全通用的解決方案(我可能會使用,作爲分隔符是我的一些數據字段的需要能夠包括;的原因之一... )。一個簡單的啓發式判斷可能是簡單閱讀第一行(或更多),計算其包含的字符數(可能忽略引號內的字符,如果有的話會創建.csv文件正確且一致地引用條目),並猜測兩者中更頻繁的是正確的分隔符。

7

給定一個項目,處理兩個,(逗號)和| (豎線)分隔的CSV文件,這是中規中矩,我嘗試以下(截至https://docs.python.org/2/library/csv.html#csv.Sniffer給出):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|') 

然而,在一個| -delimited文件,返回了「無法確定分隔符」異常。推測如果每條線具有相同數量的分隔符(不包括引號中可能包含的任何內容),則嗅探啓發式可能效果最好。因此,而不是讀取前1024個字節的文件,我試着在整體閱讀前兩行:

temp_lines = csvfile.readline() + '\n' + csvfile.readline() 
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|') 

到目前爲止,這對我來說是運作良好。

+2

這對我很有幫助!我遇到了其中一個「掛鉤」值是帶有逗號的數字的數據的問題,所以它一直在失敗。這限制了前兩條線的確有所幫助。 – mauve 2016-05-05 18:08:32

+0

太好了,用我的|分離的「csv」文件爲我工作。謝謝:) – EisenHeim 2017-02-13 13:11:33

6

如果你正在使用DictReader你可以這樣做:

#!/usr/bin/env python 
import csv 

def parse(filename): 
    with open(filename, 'rb') as csvfile: 
     dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') 
     csvfile.seek(0) 
     reader = csv.DictReader(csvfile, dialect=dialect) 

     for line in reader: 
      print(line['ReleveAnnee']) 

我用這與Python 3.5和它的工作這種方式。

+1

我在Python 2.7中使用它 – alvaro562003 2017-03-01 10:02:03

相關問題