2014-10-28 106 views
2

我想分割,組和計算C#中的大字符串中特定短語的出現次數。拆分,組和計數字符串

下面的僞代碼應該給出我想要實現的一些指示。

var my_string = "In the end this is not the end"; 
my_string.groupCount(2); 

==> 
    [0] : {Key: "In the", Count:1} 
    [1] : {Key: "the end", Count:2} 
    [2] : {Key: "end this", Count: 1} 
    [3] : {Key: "this is", Count: 1} 
    [4] : {Key: "is not", Count: 1} 
    [5] : {Key: "not the", Count: 1} 

正如您會注意到的,這不像分割字符串和計算每個子字符串那樣簡單。這個例子每兩個字組一次,但理想情況下它應該能夠處理任何數字。

+0

謝謝,我錯過了。 – tribe84 2014-10-28 20:38:51

+0

@GrantWinney - 不,這兩個問題是相似的,但不一樣。 – tribe84 2014-10-28 20:39:39

+0

你想如何分割'input'? – 2014-10-28 20:49:59

回答

1

這裏是你如何處理這個大綱:

  • 使用string經常Split方法來獲得個人的話
  • 做一個字典的計數
  • 通過所有對進入建立複合鍵和遞增計數

這裏是你如何實現這個:

var counts = new Dictionary<string,int>(); 
var tokens = str.Split(' '); 
for (var i = 0 ; i < tokens.Length-1 ; i++) { 
    var key = tokens[i]+" "+tokens[i+1]; 
    int c; 
    if (!counts.TryGetValue(key, out c)) { 
     c = 0; 
    } 
    counts[key] = c + 1; 
} 

Demo.

+0

如果字符串很大,會發生什麼情況? – gabba 2014-10-28 21:23:55

+0

@gabba在不是很大的字符串的情況下會發生同樣的情況:-)任務在時間和內存上是線性的。 – dasblinkenlight 2014-10-28 21:26:38

+0

當您將2GB的字符串拆分爲數千個小字符串時,您將獲得更多的雙倍內存消耗。我們不需要這樣做。我們只需要做一次掃描,還有小字典。 – gabba 2014-10-28 21:31:46

0

這裏是我的實現。我已經更新它將工作轉移到函數中,並允許您指定任意組大小。

public static Dictionary<string,int> groupCount(string str, int groupSize) 
{ 
    string[] tokens = str.Split(new char[] { ' ' }); 

    var dict = new Dictionary<string,int>(); 
    for (int i = 0; i < tokens.Length - (groupSize-1); i++) 
    { 
     string key = ""; 
     for (int j = 0; j < groupSize; j++) 
     { 
      key += tokens[i+j] + " "; 
     } 
     key = key.Substring(0, key.Length-1); 

     if (dict.ContainsKey(key)) { 
      dict[key]++; 
     } else { 
      dict[key] = 1; 
     } 
    } 

    return dict; 
} 

使用方法如下:

string str = "In the end this is not the end"; 
int groupSize = 2; 
var dict = groupCount(str, groupSize); 

Console.WriteLine("Group Of {0}:", groupSize); 
foreach (string k in dict.Keys) { 
    Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]); 
} 

.NET Fiddle

+0

我會注意到它與dasblinkenlight的拍攝非常相似。它使用Split來獲取單個單詞,使用for循環獲取令牌,並使用字典來維護要獲取的令牌計數。 – 2014-10-28 20:51:15

0

您可以創建方法,建立從給出的單詞短語。效率不是很高(因爲跳過),但簡單的實現:

private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount) 
{ 
    for(int i = 0; i <= words.Length - wordsCount; i++) 
     yield return String.Join(" ", words.Skip(i).Take(wordsCount)); 
} 

休息很簡單 - 分割你的串入的話,建立短語,並獲得原始字符串每個短語的出現:

var my_string = "In the end this is not the end"; 
var words = my_string.Split(); 
var result = from p in CreatePhrases(words, 2) 
      group p by p into g 
      select new { g.Key, Count = g.Count()}; 

結果:

[ 
    Key: "In the", Count: 1, 
    Key: "the end", Count: 2, 
    Key: "end this", Count: 1, 
    Key: "this is", Count: 1, 
    Key: "is not", Count: 1, 
    Key: "not the", Count: 1 
] 

創建項目的連續組(更有效的方法適用於任何我枚舉):

public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
    this IEnumerable<T> source, int size) 
{ 
    // You can check arguments here    
    Queue<T> bucket = new Queue<T>(); 

    foreach(var item in source) 
    { 
     bucket.Enqueue(item); 
     if (bucket.Count == size) 
     { 
      yield return bucket.ToArray(); 
      bucket.Dequeue(); 
     } 
    } 
} 

而且所有的計算可以在一個行完成:

var my_string = "In the end this is not the end"; 
var result = my_string.Split() 
       .ToConsecutiveGroups(2) 
       .Select(words => String.Join(" ", words)) 
       .GroupBy(p => p) 
       .Select(g => new { g.Key, Count = g.Count()}); 
+1

Yeeaahh,在最後一行的正則表達式最好的解決方案在這裏 – gabba 2014-10-28 21:29:04

+1

@gabba是,最好從我在凌晨1點:)而不是隻是返回計數,我做不同:) – 2014-10-28 21:40:36

+0

如果你編寫SplitToConsecutiveGroups方法迭代通過,你的靈魂會更糟糕源字符串和返回字組的組合 – gabba 2014-10-29 09:07:03

1

下面是使用ILookup<string, string[]>計算每個陣列的發生另一種方法:

var my_string = "In the end this is not the end"; 
int step = 2; 
string[] words = my_string.Split(); 
var groupWords = new List<string[]>(); 
for (int i = 0; i + step <= words.Length; i++) 
{ 
    string[] group = new string[step]; 
    for (int ii = 0; ii < step; ii++) 
     group[ii] = words[i + ii]; 
    groupWords.Add(group); 
} 
var lookup = groupWords.ToLookup(w => string.Join(" ", w)); 

foreach(var kv in lookup) 
    Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count()); 

輸出:

Key: "In the", Count: 1 
Key: "the end", Count: 2 
Key: "end this", Count: 1 
Key: "this is", Count: 1 
Key: "is not", Count: 1 
Key: "not the", Count: 1 
+1

不錯!在這裏查找是很好的 – gabba 2014-10-28 21:44:06

0

假設你需要處理大字符串,我不會推薦你分割整個字符串。 你需要去通過它,還記得去年groupCount單詞和在詞典]數組合:@dasblinkenlight

var my_string = "In the end this is not the end"; 

    var groupCount = 2; 

    var groups = new Dictionary<string, int>(); 
    var lastGroupCountWordIndexes = new Queue<int>(); 

    for (int i = 0; i < my_string.Length; i++) 
    { 
     if (my_string[i] == ' ' || i == 0) 
     { 
      lastGroupCountWordIndexes.Enqueue(i); 
     } 

     if (lastGroupCountWordIndexes.Count >= groupCount) 
     { 
      var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue(); 

      var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex); 

      if (!groups.ContainsKey(gruopKey)) 
      { 
       groups.Add(gruopKey, 1); 
      } 
      else 
      { 
       groups[gruopKey]++; 
      } 
     } 

    }