2011-11-14 49 views
5

我想將字符串拆分爲列表或數組。如何使用正則表達式拆分字符串

輸入:green,"yellow,green",white,orange,"blue,black"

分割字符是逗號(,),但它必須忽略引號內的逗號。

輸出應該是:

  • 綠色
  • 黃,綠
  • 白色
  • 橙色
  • 藍色,黑色

感謝。

+0

你需要使用正則表達式? – rownage

+0

hui,請隨意a)評論,b)upvote和c)選擇最能幫助您的答案。 :) –

回答

11

其實這是很容易只使用匹配:

 string subjectString = @"green,""yellow,green"",white,orange,""blue,black"""; 
     try 
     { 
      Regex regexObj = new Regex(@"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase); 
      Match matchResults = regexObj.Match(subjectString); 
      while (matchResults.Success) 
      { 
       Console.WriteLine("{0}", matchResults.Value); 
       // matched text: matchResults.Value 
       // match start: matchResults.Index 
       // match length: matchResults.Length 
       matchResults = matchResults.NextMatch(); 
      } 

輸出:

green 
yellow,green 
white 
orange 
blue,black 

說明:

@" 
      # Match either the regular expression below (attempting the next alternative only if this one fails) 
    (?<=   # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) 
     ""   # Match the character 「""」 literally 
    ) 
    \b   # Assert position at a word boundary 
    [a-z,]  # Match a single character present in the list below 
        # A character in the range between 「a」 and 「z」 
        # The character 「,」 
     +   # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
    \b   # Assert position at a word boundary 
    (?=   # Assert that the regex below can be matched, starting at this position (positive lookahead) 
     ""   # Match the character 「""」 literally 
    ) 
|   # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) 
    [a-z]  # Match a single character in the range between 「a」 and 「z」 
     +   # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
" 
+0

@downvoter你願意解釋你的投票背後的原因嗎?或者你會繼續躲在你的匿名背後? – FailedDev

+1

大聲笑首先接受的答案已經downvote。這是否有徽章? :D – FailedDev

+0

我不認爲正則表達式是解決這個問題的方法,因爲所有的前後追蹤都很慢。但我會給你一個解決這個簡單例子的投票。 –

2

有人會很快想出一個答案,這與一個單一的正則表達式。我並不聰明,但僅僅爲了平衡,這裏有一個建議,完全不使用正則表達式。基於舊格言,當你嘗試用正則表達式解決問題時,你會遇到兩個問題。 :)

個人給我的正則表達式缺乏福,我會做下列操作之一:

  • 使用一個簡單的regex基於Replace逃避任何逗號報價別的東西(即"&comma;")。然後,您可以在結果中使用簡單的string.Split(),並在使用結果數組之前對結果數組中的每個項目執行unescape。這很糟糕。部分原因是它處理所有事情,部分原因是它也使用正則表達式。 Boooo!
  • 用手解析它,char by char。將字符串轉換爲char數組,然後遍歷它,記下是否「在引號內」,並一次構建結果數組。
  • 與以前的建議相同,但使用互聯網上的某個人的csv解析器。我在下面創建的示例並不完全通過csv規範中的所有測試,所以它只是一個真正的指導來說明我的觀點。

如果編寫得很好,非正則表達式選項的表現會更好,因爲正則表達式可能會有點貴,因爲它們在內部查找模式時會掃描字符串。

真的,我只是想指出,你不必使用正則表達式。 :)

這是我的第二個建議相當天真的實施。在我的電腦上,它很快樂地在4.5秒內解析100萬個15列字符串。

public class ManualParser : IParser 
{ 
    public IEnumerable<string> Parse(string line) 
    { 
     if (string.IsNullOrWhiteSpace(line)) return new List<string>(); 

     line = line.Trim(); 

     if (line.Contains(",") == false) return new[] { line.Trim('"') }; 

     if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim()); 

     bool withinQuotes = false; 
     var builder = new List<string>(); 
     var trimChars = new[] { ' ', '"' }; 

     int left = 0; 
     int right = 0; 

     for (right = 0; right < line.Length; right++) 
     { 
      char c = line[right]; 

      if (c == '"') 
      { 
       withinQuotes = !withinQuotes; 
       continue; 
      } 

      if (c == ',' && !withinQuotes) 
      { 
       builder.Add(line.Substring(left, right - left).Trim(trimChars)); 
       right++; // Jump the comma 
       left = right; 
      } 
     } 

     builder.Add(line.Substring(left, right - left).Trim(trimChars)); 

     return builder; 
    } 
} 

這裏的一些單元測試吧:

[TestFixture] 
public class ManualParserTests 
{ 
    [Test] 
    public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString() 
    { 
     // Arrange 
     var parser = new ManualParser(); 

     // Act 
     string[] result = parser.Parse("This is my data").ToArray(); 

     // Assert 
     Assert.AreEqual(1, result.Length, "Should only be one column returned"); 
     Assert.AreEqual("This is my data", result[0], "Incorrect value is returned"); 
    } 

    [Test] 
    public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns() 
    { 
     // Arrange 
     var parser = new ManualParser(); 

     // Act 
     string[] result = parser.Parse("This is, my data").ToArray(); 

     // Assert 
     Assert.AreEqual(2, result.Length, "Should be 2 columns returned"); 
     Assert.AreEqual("This is", result[0], "First value is incorrect"); 
     Assert.AreEqual("my data", result[1], "Second value is incorrect"); 
    } 

    [Test] 
    public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes() 
    { 
     // Arrange 
     var parser = new ManualParser(); 

     // Act 
     string[] result = parser.Parse("\"This is my data\"").ToArray(); 

     // Assert 
     Assert.AreEqual(1, result.Length, "Should be 1 column returned"); 
     Assert.AreEqual("This is my data", result[0], "Value is incorrect"); 
    } 

    [Test] 
    public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes() 
    { 
     // Arrange 
     var parser = new ManualParser(); 

     // Act 
     string[] result = parser.Parse("\"This is\", my data").ToArray(); 

     // Assert 
     Assert.AreEqual(2, result.Length, "Should be 2 columns returned"); 
     Assert.AreEqual("This is", result[0], "First value is incorrect"); 
     Assert.AreEqual("my data", result[1], "Second value is incorrect"); 
    } 

    [Test] 
    public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes() 
    { 
     // Arrange 
     var parser = new ManualParser(); 

     // Act 
     string[] result = parser.Parse("\"This, is\", my data").ToArray(); 

     // Assert 
     Assert.AreEqual(2, result.Length, "Should be 2 columns returned"); 
     Assert.AreEqual("This, is", result[0], "First value is incorrect"); 
     Assert.AreEqual("my data", result[1], "Second value is incorrect"); 
    } 
} 

這裏就是我與測試的吞吐量的示例應用程序:

class Program 
{ 
    static void Main(string[] args) 
    { 
     RunTest(); 
    } 

    private static void RunTest() 
    { 
     var parser = new ManualParser(); 
     string csv = Properties.Resources.Csv; 
     var result = new StringBuilder(); 
     var s = new Stopwatch(); 

     for (int test = 0; test < 3; test++) 
     { 
      int lineCount = 0; 

      s.Start(); 
      for (int i = 0; i < 1000000/50; i++) 
      { 
       foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None)) 
       { 
        string cur = line + s.ElapsedTicks.ToString(); 
        result.AppendLine(parser.Parse(cur).ToString()); 
        lineCount++; 
       } 
      } 
      s.Stop(); 
      Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds); 
      s.Reset(); 
      result = new StringBuilder(); 
     } 
    } 
} 
2

您試圖拆分字符串的格式似乎是標準的CSV。使用CSV解析器可能會更容易/更快。

5

你有什麼有一個不規則的語言。換句話說,一個字符的含義取決於它之前或之後的字符序列。顧名思義,正則表達式用於解析正則語言。

你需要的是一個TokenizerParser,一個很好的互聯網搜索引擎應該引導你的例子。事實上,因爲令牌只是字符,你可能甚至不需要Tokenizer。

雖然你可以使用正則表達式來做這個簡單的例子,但它可能會很慢。如果引用不是平衡的,那麼它也可能引發問題,因爲正則表達式不會檢測到這個錯誤,而解析器可能會這樣做。

如果您要導入CSV文件,您可能需要查看分析CSV文件的Microsoft.VisualBasic.FileIO.TextFieldParser類(只需在C#項目中添加對Microsoft.VisualBasic.dll的引用)即可。

另一種方式做,這是寫自己的state machine(下面的例子),儘管這仍然不能解決報價的問題,在一個值的中間:

using System; 
using System.Text; 

namespace Example 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      string subjectString = @"green,""yellow,green"",white,orange,""blue,black"""; 

      bool inQuote = false; 
      StringBuilder currentResult = new StringBuilder(); 
      foreach (char c in subjectString) 
      { 
       switch (c) 
       { 
        case '\"': 
         inQuote = !inQuote; 
         break; 

        case ',': 
         if (inQuote) 
         { 
          currentResult.Append(c); 
         } 
         else 
         { 
          Console.WriteLine(currentResult); 
          currentResult.Clear(); 
         } 
         break; 

        default: 
         currentResult.Append(c); 
         break; 
       } 
      } 
      if (inQuote) 
      { 
       throw new FormatException("Input string does not have balanced Quote Characters"); 
      } 
      Console.WriteLine(currentResult); 
     } 
    } 
} 
相關問題