我想將字符串拆分爲列表或數組。如何使用正則表達式拆分字符串
輸入:green,"yellow,green",white,orange,"blue,black"
分割字符是逗號(,
),但它必須忽略引號內的逗號。
輸出應該是:
- 綠色
- 黃,綠
- 白色
- 橙色
- 藍色,黑色
感謝。
我想將字符串拆分爲列表或數組。如何使用正則表達式拆分字符串
輸入:green,"yellow,green",white,orange,"blue,black"
分割字符是逗號(,
),但它必須忽略引號內的逗號。
輸出應該是:
感謝。
其實這是很容易只使用匹配:
string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";
try
{
Regex regexObj = new Regex(@"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
Console.WriteLine("{0}", matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
輸出:
green
yellow,green
white
orange
blue,black
說明:
@"
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
"" # Match the character 「""」 literally
)
\b # Assert position at a word boundary
[a-z,] # Match a single character present in the list below
# A character in the range between 「a」 and 「z」
# The character 「,」
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
"" # Match the character 「""」 literally
)
| # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
[a-z] # Match a single character in the range between 「a」 and 「z」
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"
有人會很快想出一個答案,這與一個單一的正則表達式。我並不聰明,但僅僅爲了平衡,這裏有一個建議,完全不使用正則表達式。基於舊格言,當你嘗試用正則表達式解決問題時,你會遇到兩個問題。 :)
個人給我的正則表達式缺乏福,我會做下列操作之一:
Replace
逃避任何逗號內報價別的東西(即","
)。然後,您可以在結果中使用簡單的string.Split()
,並在使用結果數組之前對結果數組中的每個項目執行unescape。這很糟糕。部分原因是它處理所有事情,部分原因是它也使用正則表達式。 Boooo!如果編寫得很好,非正則表達式選項的表現會更好,因爲正則表達式可能會有點貴,因爲它們在內部查找模式時會掃描字符串。
真的,我只是想指出,你不必使用正則表達式。 :)
這是我的第二個建議相當天真的實施。在我的電腦上,它很快樂地在4.5秒內解析100萬個15列字符串。
public class ManualParser : IParser
{
public IEnumerable<string> Parse(string line)
{
if (string.IsNullOrWhiteSpace(line)) return new List<string>();
line = line.Trim();
if (line.Contains(",") == false) return new[] { line.Trim('"') };
if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());
bool withinQuotes = false;
var builder = new List<string>();
var trimChars = new[] { ' ', '"' };
int left = 0;
int right = 0;
for (right = 0; right < line.Length; right++)
{
char c = line[right];
if (c == '"')
{
withinQuotes = !withinQuotes;
continue;
}
if (c == ',' && !withinQuotes)
{
builder.Add(line.Substring(left, right - left).Trim(trimChars));
right++; // Jump the comma
left = right;
}
}
builder.Add(line.Substring(left, right - left).Trim(trimChars));
return builder;
}
}
這裏的一些單元測試吧:
[TestFixture]
public class ManualParserTests
{
[Test]
public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is my data").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should only be one column returned");
Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
}
[Test]
public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is, my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is my data\"").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should be 1 column returned");
Assert.AreEqual("This is my data", result[0], "Value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This, is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This, is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
}
這裏就是我與測試的吞吐量的示例應用程序:
class Program
{
static void Main(string[] args)
{
RunTest();
}
private static void RunTest()
{
var parser = new ManualParser();
string csv = Properties.Resources.Csv;
var result = new StringBuilder();
var s = new Stopwatch();
for (int test = 0; test < 3; test++)
{
int lineCount = 0;
s.Start();
for (int i = 0; i < 1000000/50; i++)
{
foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
string cur = line + s.ElapsedTicks.ToString();
result.AppendLine(parser.Parse(cur).ToString());
lineCount++;
}
}
s.Stop();
Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
s.Reset();
result = new StringBuilder();
}
}
}
您試圖拆分字符串的格式似乎是標準的CSV。使用CSV解析器可能會更容易/更快。
你有什麼有一個不規則的語言。換句話說,一個字符的含義取決於它之前或之後的字符序列。顧名思義,正則表達式用於解析正則語言。
你需要的是一個Tokenizer和Parser,一個很好的互聯網搜索引擎應該引導你的例子。事實上,因爲令牌只是字符,你可能甚至不需要Tokenizer。
雖然你可以使用正則表達式來做這個簡單的例子,但它可能會很慢。如果引用不是平衡的,那麼它也可能引發問題,因爲正則表達式不會檢測到這個錯誤,而解析器可能會這樣做。
如果您要導入CSV文件,您可能需要查看分析CSV文件的Microsoft.VisualBasic.FileIO.TextFieldParser類(只需在C#項目中添加對Microsoft.VisualBasic.dll的引用)即可。
另一種方式做,這是寫自己的state machine(下面的例子),儘管這仍然不能解決報價的問題,在一個值的中間:
using System;
using System.Text;
namespace Example
{
class Program
{
static void Main(string[] args)
{
string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";
bool inQuote = false;
StringBuilder currentResult = new StringBuilder();
foreach (char c in subjectString)
{
switch (c)
{
case '\"':
inQuote = !inQuote;
break;
case ',':
if (inQuote)
{
currentResult.Append(c);
}
else
{
Console.WriteLine(currentResult);
currentResult.Clear();
}
break;
default:
currentResult.Append(c);
break;
}
}
if (inQuote)
{
throw new FormatException("Input string does not have balanced Quote Characters");
}
Console.WriteLine(currentResult);
}
}
}
你需要使用正則表達式? – rownage
hui,請隨意a)評論,b)upvote和c)選擇最能幫助您的答案。 :) –