2011-12-29 131 views
3

我想用空格,除非字符串中的文本是雙引號(「文本」)或單引號(「文本」),爲了將一個字符串。分割字符串用空格在C#

我與這個功能做:

public static string[] ParseKeywordExpression(string keywordExpressionValue, bool isUniqueKeywordReq) 
{ 
    keywordExpressionValue = keywordExpressionValue.Trim(); 
    if (keywordExpressionValue == null || !(keywordExpressionValue.Length > 0)) 
     return new string[0]; 
    int idx = keywordExpressionValue.Trim().IndexOf(" "); 
    if (idx == -1) 
     return new string[] { keywordExpressionValue }; 
    //idx = idx + 1; 
    int count = keywordExpressionValue.Length; 
    ArrayList extractedList = new ArrayList(); 
    while (count > 0) 
    { 
     if (keywordExpressionValue[0] == '"') 
     { 
      int temp = keywordExpressionValue.IndexOf(BACKSLASH, 1, keywordExpressionValue.Length - 1); 
      while (keywordExpressionValue[temp - 1] == '\\') 
      { 
       temp = keywordExpressionValue.IndexOf(BACKSLASH, temp + 1, keywordExpressionValue.Length - temp - 1); 
      } 
      idx = temp + 1; 
     } 
     if (keywordExpressionValue[0] == '\'') 
     { 
      int temp = keywordExpressionValue.IndexOf(BACKSHASH_QUOTE, 1, keywordExpressionValue.Length - 1); 
      while (keywordExpressionValue[temp - 1] == '\\') 
      { 
       temp = keywordExpressionValue.IndexOf(BACKSHASH_QUOTE, temp + 1, keywordExpressionValue.Length - temp - 1); 
      } 
      idx = temp + 1; 
     } 
     string s = keywordExpressionValue.Substring(0, idx); 
     int left = count - idx; 
     keywordExpressionValue = keywordExpressionValue.Substring(idx, left).Trim(); 
     if (isUniqueKeywordReq)      
     { 
      if (!extractedList.Contains(s.Trim('"'))) 
      { 
       extractedList.Add(s.Trim('"')); 
      } 
     } 
     else 
     { 
      extractedList.Add(s.Trim('"')); 
     } 
     count = keywordExpressionValue.Length; 
     idx = keywordExpressionValue.IndexOf(SPACE); 
     if (idx == -1) 
     { 
      string add = keywordExpressionValue.Trim('"', ' '); 
      if (add.Length > 0) 
      { 
       if (isUniqueKeywordReq) 
       { 
        if (!extractedList.Contains(add)) 
        { 
         extractedList.Add(add); 
        } 
       } 
       else 
       { 
        extractedList.Add(add); 
       } 
      }     
      break; 
     } 
    } 
    return (string[])extractedList.ToArray(typeof(string)); 
} 

是否有任何其他的方式來做到這一點,也可以此功能可以優化?

例如,我想拆分字符串

%ABC%%aasdf%aalasdjjfas 「C:\文件和設置\ Program Files文件\ abc.exe」

%ABC%
%aasdf%
aalasdjjfas
「C:\文獻和設置\ Program Files文件\ abc.exe」

+0

所以找到一個CSV正則表達式,並適應它使用'\ s'而不是逗號? – 2011-12-29 13:37:25

+0

@BradChristie我已經編輯了我對我多麼希望輸出quiestion。我不thinl CSV正則表達式將有助於 – Ankesh 2011-12-29 13:47:19

回答

6

造成這種情況的最簡單的很正則表達式,處理單引號和雙引號:

("((\\")|([^"]))*")|('((\\')|([^']))*')|(\S+)

var regex = new Regex(@"(""((\\"")|([^""]))*"")|('((\\')|([^']))*')|(\S+)"); 
var matches = regex.Matches(inputstring); 
foreach (Match match in matches) { 
    extractedList.Add(match.Value); 
} 

所以基本上代碼四到五線是足夠。

表達,解釋說:

Main structure: 
("((\\")|([^"]))*") Double-quoted token 
|      , or 
('((\\')|([^']))*') single-quoted token 
|      , or 
(\S+)     any group of non-space characters 

Double-quoted token: 
(      Group starts 
    "     Initial double-quote 
    (     Inner group starts 
     (\\")   Either a backslash followed by a double-quote 
     |    , or 
     ([^"])   any non-double-quote character 
    )*     The inner group repeats any number of times (or zero) 
    "     Ending double-quote 
) 

Single-quoted token: 
(      Group starts 
    '     Initial single-quote 
    (     Inner group starts 
     (\\')   Either a backslash followed by a single-quote 
     |    , or 
     ([^'])   any non-single-quote character 
    )*     The inner group repeats any number of times (or zero) 
    '     Ending single-quote 
) 

Non-space characters: 
(      Group starts 
    \S     Non-white-space character 
    +     , repeated at least once 
)      Group ends 
+0

是其對雙引號,但不能在單引號工作EX-%ABC%%aasdf%aalasdjjfas 「C:\ Doctment和設置\ Program Files文件\ abc.exe」 C:\ Doctment和設置\ Program Files \ abc.exe' – Ankesh 2011-12-29 14:01:26

+0

更新我的答案還包括單引號。 – 2011-12-29 14:31:13

+0

你的正則表達式工作很好...... :)。謝謝:) – Ankesh 2011-12-30 06:18:29

2

如果你不喜歡正則表達式,這種方法應該能夠分裂引用的字符串,而忽略連續的空格:

public IEnumerable<string> SplitString(string input) 
{ 
    var isInDoubleQuote = false; 
    var isInSingleQuote = false; 
    var sb = new StringBuilder(); 
    foreach (var c in input) 
    { 
     if (!isInDoubleQuote && c == '"') 
     { 
      isInDoubleQuote = true; 
      sb.Append(c); 
     } 
     else if (isInDoubleQuote) 
     { 
      sb.Append(c); 
      if (c != '"') 
       continue; 
      if (sb.Length > 2) 
       yield return sb.ToString(); 
      sb = sb.Clear(); 
      isInDoubleQuote = false; 
     } 
     else if (!isInSingleQuote && c == '\'') 
     { 
      isInSingleQuote = true; 
      sb.Append(c); 
     } 
     else if (isInSingleQuote) 
     { 
      sb.Append(c); 
      if (c != '\'') 
       continue; 
      if (sb.Length > 2) 
       yield return sb.ToString(); 
      sb = sb.Clear(); 
      isInSingleQuote = false; 
     } 
     else if (c == ' ') 
     { 
      if (sb.Length == 0) 
       continue; 
      yield return sb.ToString(); 
      sb.Clear(); 
     } 
     else 
      sb.Append(c); 
    } 
    if (sb.Length > 0) 
     yield return sb.ToString(); 
} 

編輯:改變返回類型IEnumerable的,使用產率和StringBuilder的

+0

這會產生很多GC'able臨時字符串,不是嗎? – 2011-12-29 17:02:43

+1

如果你不打算打的結果不止一次,而只是'通過他們foreach',然後更改返回類型爲'IEumerable '和更換'output.Add'用'產量回報curentString通話;'是個好主意。這也是使用'StringBuilder'而不是大量連接的情況。 – 2011-12-29 17:06:19

+0

我完全同意@JonHanna。 'yield return'是C#未被充分利用的特性。 'StringBuilder'參數是有效的,但由於它可能僅用於解析命令行參數序列,所以性能下降並不是很大。但是,儘管如此,對於草率代碼沒有任何理由。 – 2011-12-29 18:17:44

2

我通過使用0的十六進制值逃脫單和雙引號字符串中的和\x22。它使模式的C#文本文本更易於閱讀和操作。

而且使用IgnorePatternWhitespace正在爲它做允許一個OT評論可讀性更好的模式;不影響正則表達式處理。

string data = @"'single' %ABC% %aasdf% aalasdjjfas ""c:\Document and Setting\Program Files\abc.exe"""; 

string pattern = @"(?xm)  # Tell the regex compiler we are commenting (x = IgnorePatternWhitespace) 
          # and tell the compiler this is multiline (m), 
          # In Multiline the^matches each start line and $ is each EOL 
          # -Pattern Start- 
^(       # Start at the beginning of the line always 
(?![\r\n]|$)    # Stop the match if EOL or EOF found. 
(?([\x27\x22])    # Regex If to check for single/double quotes 
     (?:[\x27\x22])   # \\x27\\x22 are single/double quotes 
     (?<Token>[^\x27\x22]+) # Match this in the quotes and place in Named match Token 
     (?:[\x27\x22]) 

    |       # or (else) part of If when Not within quotes 

    (?<Token>[^\s\r\n]+) # Not within quotes, but put it in the Token match group 
)       # End of Pattern OR 

(?:\s?)      # Either a space or EOL/EOF 
)+       # 1 or more tokens of data. 
"; 

Console.WriteLine(string.Join(" | ", 

Regex.Match(data, pattern) 
     .Groups["Token"] 
     .Captures 
     .OfType<Capture>() 
     .Select(cp => cp.Value) 
       ) 
       ); 
/* Output 
single | %ABC% | %aasdf% | aalasdjjfas | c:\Document and Setting\Program Files\abc.exe 
*/ 

以上是基於我寫了下面的兩個博客文章:

+1

我很高興你找到你的答案。我非常信任正則表達式,如果人們花時間學習它,它是一個強大的工具,不管語言如何(C#/ Java/php),都可以在整個過程中使用它。 :-) – OmegaMan 2011-12-30 08:34:50