最小化LINQ字符串令牌計數器

對an earlier question的回答進行跟進。最小化LINQ字符串令牌計數器

有沒有辦法進一步減少這個，避免外掛String.Split電話？目標是一個關聯容器{token, count}。

string src = "for each character in the string, take the rest of the " + 
    "string starting from that character " + 
    "as a substring; count it if it starts with the target string"; 

string[] target = src.Split(new char[] { ' ' }); 

var results = target.GroupBy(t => new 
{ 
    str = t, 
    count = target.Count(sub => sub.Equals(t)) 
});

來源

2010-10-28 Steve Townsend

爲什麼你不想使用'string.Split'來標記？ – 2010-10-28 00:50:12

@Kirk - 這不是我想要避開斯普利特，我只是尋找更優雅和高效（如果可能的話）公式。 – 2010-10-28 01:54:29

就像你現在這樣做，它會工作（在某種程度上），但效率非常低。因爲結果是分組的枚舉，而不是你可能想到的（單詞，數字）對。

GroupBy()的過載需要一個函數來選擇密鑰。您正在有效地對集合中的每個項目執行該計算。沒有去使用正則表達式忽略標點符號的途徑，應當書面像這樣：

string src = "for each character in the string, take the rest of the " + 
      "string starting from that character " + 
      "as a substring; count it if it starts with the target string"; 

var results = src.Split()    // default split by whitespace 
       .GroupBy(str => str) // group words by the value 
       .Select(g => new 
           { 
            str = g.Key,  // the value 
            count = g.Count() // the count of that value 
           }); 

// sort the results by the words that were counted 
var sortedResults = results.OrderByDescending(p => p.str);

來源

2010-10-28 01:33:22

謝謝，這是最好回答問題的那個。並澄清 - 我是Linq新手，額外的信息是有幫助的。 – 2010-10-28 01:51:40

如何在此添加OrderByDescending？ – SharpAffair 2010-11-03 13:24:25

@Sphynx：只需在您希望排序的位置將調用添加到'OrderByDescending（）'。只要注意你在查詢中排序的項目。例如，如果放在'GroupBy（）'之前，那麼您正在對字符串進行排序。如果之後，你正在排序字符串的分組。如果在Select（）後面，則按匿名類型進行排序。我會更新以納入您的請求。 – 2010-11-03 19:41:03

雖然慢3-4倍，正則表達式的方法可以說是更精確的：

string src = "for each character in the string, take the rest of the " + 
    "string starting from that character " + 
    "as a substring; count it if it starts with the target string"; 

var regex=new Regex(@"\w+",RegexOptions.Compiled); 
var sw=new Stopwatch(); 

for (int i = 0; i < 100000; i++) 
{ 
    var dic=regex 
     .Matches(src) 
     .Cast<Match>() 
     .Select(m=>m.Value) 
     .GroupBy(s=>s) 
     .ToDictionary(g=>g.Key,g=>g.Count()); 
    if(i==1000)sw.Start(); 
} 
Console.WriteLine(sw.Elapsed); 

sw.Reset(); 

for (int i = 0; i < 100000; i++) 
{ 
    var dic=src 
     .Split(' ') 
     .GroupBy(s=>s) 
     .ToDictionary(g=>g.Key,g=>g.Count()); 
    if(i==1000)sw.Start(); 
} 
Console.WriteLine(sw.Elapsed);

例如，正則表達式方法不會計數string和string,作爲兩個單獨條目，並且將正確tokenise substring而不是substring;。

編輯

看了你前面的問題，實現我的代碼並不完全符合你的規格。無論如何，它仍然證明了使用正則表達式的好處/成本。

來源

2010-10-28 01:20:12 spender

感謝您的有趣的選擇。 – 2010-10-28 01:51:56

這裏有一個LINQ版本沒有ToDictionary()，它可以根據你的需要添加不必要的開銷......

var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() });

或者查詢語法...

var dic = from str in src.Split(' ') 
      group str by str into g 
      select new { str, count = g.Count() };

來源

2010-10-28 01:33:26 dahlbyk

+1對於死熱與已接受的答案。 – 2010-10-28 01:52:55

擺脫String.Split不留在桌子上許多選項。一種選擇是Regex.Matches作爲spender demonstrated，另一種是Regex.Split（它不給我們任何新東西）。

而不是分組你可以使用這些方法之一：需要

var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries); 
var result = target.Distinct() 
        .Select(s => new { Word = s, Count = target.Count(w => w == s) }); 

// or dictionary approach 
var result = target.Distinct() 
        .ToDictionary(s => s, s => target.Count(w => w == s));

的Distinct呼籲避免重複的項目。我繼續前進，擴展字符以分裂以獲得沒有標點符號的實際字詞。我發現第一種方法是使用支持者基準代碼的最快方法。

var result = target.Distinct() 
        .Select(s => new { Word = s, Count = target.Count(w => w == s) }) 
        .OrderByDescending(o => o.Count); 

// or in query form 

var result = from s in target.Distinct() 
      let count = target.Count(w => w == s) 
      orderby count descending 
      select new { Word = s, Count = count };

編輯：

返回訂購從您前面所提到的問題結果的要求，如下所示，你可以很容易地擴展了第一種方法，因爲匿名類型擺脫了元組是近在咫尺。

來源

2010-10-28 01:59:24

謝謝艾哈邁德，尤其是。用於基準信息 – 2010-10-28 11:15:13

最小化LINQ字符串令牌計數器

回答

相關問題