拆分，組和計數字符串

我想分割，組和計算C＃中的大字符串中特定短語的出現次數。拆分，組和計數字符串

下面的僞代碼應該給出我想要實現的一些指示。

var my_string = "In the end this is not the end"; 
my_string.groupCount(2); 

==> 
    [0] : {Key: "In the", Count:1} 
    [1] : {Key: "the end", Count:2} 
    [2] : {Key: "end this", Count: 1} 
    [3] : {Key: "this is", Count: 1} 
    [4] : {Key: "is not", Count: 1} 
    [5] : {Key: "not the", Count: 1}

正如您會注意到的，這不像分割字符串和計算每個子字符串那樣簡單。這個例子每兩個字組一次，但理想情況下它應該能夠處理任何數字。

來源

2014-10-28 tribe84

謝謝，我錯過了。 – tribe84 2014-10-28 20:38:51

@GrantWinney - 不，這兩個問題是相似的，但不一樣。 – tribe84 2014-10-28 20:39:39

你想如何分割'input'？ – 2014-10-28 20:49:59

這裏是你如何處理這個大綱：

使用string經常Split方法來獲得個人的話
做一個字典的計數
通過所有對進入建立複合鍵和遞增計數

這裏是你如何實現這個：

var counts = new Dictionary<string,int>(); 
var tokens = str.Split(' '); 
for (var i = 0 ; i < tokens.Length-1 ; i++) { 
    var key = tokens[i]+" "+tokens[i+1]; 
    int c; 
    if (!counts.TryGetValue(key, out c)) { 
     c = 0; 
    } 
    counts[key] = c + 1; 
}

Demo.

來源

2014-10-28 20:47:38 dasblinkenlight

如果字符串很大，會發生什麼情況？ – gabba 2014-10-28 21:23:55

@gabba在不是很大的字符串的情況下會發生同樣的情況:-)任務在時間和內存上是線性的。 – dasblinkenlight 2014-10-28 21:26:38

當您將2GB的字符串拆分爲數千個小字符串時，您將獲得更多的雙倍內存消耗。我們不需要這樣做。我們只需要做一次掃描，還有小字典。 – gabba 2014-10-28 21:31:46

這裏是我的實現。我已經更新它將工作轉移到函數中，並允許您指定任意組大小。

public static Dictionary<string,int> groupCount(string str, int groupSize) 
{ 
    string[] tokens = str.Split(new char[] { ' ' }); 

    var dict = new Dictionary<string,int>(); 
    for (int i = 0; i < tokens.Length - (groupSize-1); i++) 
    { 
     string key = ""; 
     for (int j = 0; j < groupSize; j++) 
     { 
      key += tokens[i+j] + " "; 
     } 
     key = key.Substring(0, key.Length-1); 

     if (dict.ContainsKey(key)) { 
      dict[key]++; 
     } else { 
      dict[key] = 1; 
     } 
    } 

    return dict; 
}

使用方法如下：

string str = "In the end this is not the end"; 
int groupSize = 2; 
var dict = groupCount(str, groupSize); 

Console.WriteLine("Group Of {0}:", groupSize); 
foreach (string k in dict.Keys) { 
    Console.WriteLine("Key: \"{0}\", Count: {1}", k, dict2[k]); 
}

.NET Fiddle

來源

2014-10-28 20:50:05

我會注意到它與dasblinkenlight的拍攝非常相似。它使用Split來獲取單個單詞，使用for循環獲取令牌，並使用字典來維護要獲取的令牌計數。 – 2014-10-28 20:51:15

您可以創建方法，建立從給出的單詞短語。效率不是很高（因爲跳過），但簡單的實現：

private static IEnumerable<string> CreatePhrases(string[] words, int wordsCount) 
{ 
    for(int i = 0; i <= words.Length - wordsCount; i++) 
     yield return String.Join(" ", words.Skip(i).Take(wordsCount)); 
}

休息很簡單 - 分割你的串入的話，建立短語，並獲得原始字符串每個短語的出現：

var my_string = "In the end this is not the end"; 
var words = my_string.Split(); 
var result = from p in CreatePhrases(words, 2) 
      group p by p into g 
      select new { g.Key, Count = g.Count()};

結果：

[ 
    Key: "In the", Count: 1, 
    Key: "the end", Count: 2, 
    Key: "end this", Count: 1, 
    Key: "this is", Count: 1, 
    Key: "is not", Count: 1, 
    Key: "not the", Count: 1 
]

創建項目的連續組（更有效的方法適用於任何我枚舉）：

public static IEnumerable<IEnumerable<T>> ToConsecutiveGroups<T>(
    this IEnumerable<T> source, int size) 
{ 
    // You can check arguments here    
    Queue<T> bucket = new Queue<T>(); 

    foreach(var item in source) 
    { 
     bucket.Enqueue(item); 
     if (bucket.Count == size) 
     { 
      yield return bucket.ToArray(); 
      bucket.Dequeue(); 
     } 
    } 
}

而且所有的計算可以在一個行完成：

var my_string = "In the end this is not the end"; 
var result = my_string.Split() 
       .ToConsecutiveGroups(2) 
       .Select(words => String.Join(" ", words)) 
       .GroupBy(p => p) 
       .Select(g => new { g.Key, Count = g.Count()});

來源

2014-10-28 20:58:28

Yeeaahh，在最後一行的正則表達式最好的解決方案在這裏 – gabba 2014-10-28 21:29:04

@gabba是，最好從我在凌晨1點:)而不是隻是返回計數，我做不同:) – 2014-10-28 21:40:36

如果你編寫SplitToConsecutiveGroups方法迭代通過，你的靈魂會更糟糕源字符串和返回字組的組合 – gabba 2014-10-29 09:07:03

下面是使用ILookup<string, string[]>計算每個陣列的發生另一種方法：

var my_string = "In the end this is not the end"; 
int step = 2; 
string[] words = my_string.Split(); 
var groupWords = new List<string[]>(); 
for (int i = 0; i + step <= words.Length; i++) 
{ 
    string[] group = new string[step]; 
    for (int ii = 0; ii < step; ii++) 
     group[ii] = words[i + ii]; 
    groupWords.Add(group); 
} 
var lookup = groupWords.ToLookup(w => string.Join(" ", w)); 

foreach(var kv in lookup) 
    Console.WriteLine("Key: \"{0}\", Count: {1}", kv.Key, kv.Count());

輸出：

Key: "In the", Count: 1 
Key: "the end", Count: 2 
Key: "end this", Count: 1 
Key: "this is", Count: 1 
Key: "is not", Count: 1 
Key: "not the", Count: 1

來源

2014-10-28 21:11:12

不錯！在這裏查找是很好的 – gabba 2014-10-28 21:44:06

假設你需要處理大字符串，我不會推薦你分割整個字符串。你需要去通過它，還記得去年groupCount單詞和在詞典]數組合：@dasblinkenlight

var my_string = "In the end this is not the end"; 

    var groupCount = 2; 

    var groups = new Dictionary<string, int>(); 
    var lastGroupCountWordIndexes = new Queue<int>(); 

    for (int i = 0; i < my_string.Length; i++) 
    { 
     if (my_string[i] == ' ' || i == 0) 
     { 
      lastGroupCountWordIndexes.Enqueue(i); 
     } 

     if (lastGroupCountWordIndexes.Count >= groupCount) 
     { 
      var firstWordInGroupIndex = lastGroupCountWordIndexes.Dequeue(); 

      var gruopKey = my_string.Substring(firstWordInGroupIndex, i - firstWordInGroupIndex); 

      if (!groups.ContainsKey(gruopKey)) 
      { 
       groups.Add(gruopKey, 1); 
      } 
      else 
      { 
       groups[gruopKey]++; 
      } 
     } 

    }

來源

2014-10-28 21:20:24 gabba

拆分，組和計數字符串

回答

相關問題