2012-07-18 92 views
1

我必須編寫(或使用現有的)csv解析庫。解析帶有未知分隔符號的csv文件

的問題是,文件在不同格式的例如不同的分隔符號上傳:

File1: 
field1; field2; field3; field4 
field1; field2; field3; field4 

File2: 
feld1, field2, field3, field4 
feld1, field2, field3, field4 

File3: 
"field1", "field2", "field3", "field4" 
"field1", "field2", "field3", "field4" 

什麼是programmaticaly瞭解哪些符號是實際的列分隔符的最佳方式?

我在考慮用符號統計分析編寫我自己的方法,但也許有現有的解決方案?

回答

1

我會使用正則表達式(希望不會得到與上次一樣多的降薪);)。我利用了backreferences這基本上允許使用以前捕獲的組。只要每行使用相同的分隔符,您也可以在同一個文件中有不同的分隔符(不知道它是否有用)。

所以,我這是怎麼建立的正則表達式:

string csvItem = @"[""']?\w+[""']?"; 
string separator = @"\s*[,\.;-]\s*"; 
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", 
    csvItem, separator); 

csvItem是在CSV項目(列)。它可以包含小寫或大寫字母,數字和下劃線,並可以選擇性地用「或」包圍。

分隔符分隔項目。它由這些字符中的一個組成。。 - - 零個或多個間隔字符。

的圖案表示,有效線由通過分離器分離的至少兩個csvItems注意反向引用 - > \ķ

這這是測試文件的內容:

field1; field2; field3; field4 
field1; field2; field3; field4 

feld1, field2, field3, field4 
feld1, field2, field3, field4 

"field1", "field2", "field3", "field4" 
"field1", "field2", "field3", "field4" 

並且採樣器樂控制檯項目:

using System; 
using System.Collections.Generic; 
using System.Linq; 
using System.Text; 
using System.IO; 
using System.Text.RegularExpressions; 

namespace csvParser { 
    class Program { 
     static void Main(string[ ] args) { 
      var lines = File.ReadAllLines(@"e:\prova.csv"); 

      for (int i = 0; i < lines.Length; i++) { 
       string csvItem = @"[""']?\w+[""']?"; 
       string separator = @"\s*[,\.;-]\s*"; 
       string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", csvItem, separator); 

       var rex = new Regex(pattern, RegexOptions.Singleline); 
       var match = rex.Match(lines[ i ]); 

       if (match == null) { 
        Console.WriteLine("No match on line {0}", i); 
        continue; 
       } 
       else { 
        string sep = match.Groups[ "sep" ].Value; 

        Console.WriteLine("--- Line #{0} ---------------", i); 
        Console.WriteLine("Line is '{0}'", lines[ i ]); 
        Console.WriteLine("Separator is '{0}'", sep); 

        Console.WriteLine("Items are:"); 
        foreach (string item in lines[ i ].Split(sep)) 
         Console.WriteLine("\t'{0}'", item); 

        Console.WriteLine(); 
       } 
      } 

      Console.ReadKey(); 
     } 
    } 

    public static partial class Extension { 
     public static string[ ] Split(this string str, string sep) { 
      return str.Split(new string[ ] { sep }, StringSplitOptions.RemoveEmptyEntries); 
     } 
    } 
} 

最後輸出:

--- Line #0 --------------- 
Line is 'field1; field2; field3; field4' 
Separator is '; ' 
Items are: 
     'field1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #1 --------------- 
Line is 'field1; field2; field3; field4' 
Separator is '; ' 
Items are: 
     'field1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #2 --------------- 
Line is '' 
Separator is '' 
Items are: 

--- Line #3 --------------- 
Line is 'feld1, field2, field3, field4' 
Separator is ', ' 
Items are: 
     'feld1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #4 --------------- 
Line is 'feld1, field2, field3, field4' 
Separator is ', ' 
Items are: 
     'feld1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #5 --------------- 
Line is '' 
Separator is '' 
Items are: 

--- Line #6 --------------- 
Line is '"field1", "field2", "field3", "field4"' 
Separator is ', ' 
Items are: 
     '"field1"' 
     '"field2"' 
     '"field3"' 
     '"field4"' 

--- Line #7 --------------- 
Line is '"field1", "field2", "field3", "field4"' 
Separator is ', ' 
Items are: 
     '"field1"' 
     '"field2"' 
     '"field3"' 
     '"field4"' 

不幸的是,正則表達式捕捉空行了。試圖修復它:)

+0

謝謝,這是f * cking真棒方法! – Ruslan 2012-07-18 15:59:14

+0

然而,你的方法需要預定義的可能分隔符列表..我想有一個方法,將調用給定文件的大多數可能的分隔符。 – Ruslan 2012-07-18 16:07:23

+1

@Ruslan:恩,我覺得這很難做到。你至少應該知道你正在尋找什麼樣的分隔符或者它們包含什麼字符。當csv用雙重空間和空間格式化時, – BlackBear 2012-07-18 16:35:07