如何在一個文本中找到10個最常見的詞

-3

所以我有任意文本在一個txt文件中，我需要找到10個最常見的詞。我應該怎麼做？我想我應該從點到點讀句子並把它放到一個數組中，但不知道怎麼做。如何在一個文本中找到10個最常見的詞

2016-11-21 Vitalius Kunigiskis

打破什麼都試過ü迄今？ –

拆分文本即，通過這些字，以便通過計數（以降序）基團，取前10 –

你可以使用LINQ實現它。嘗試是這樣的：

var words = "two one three one three one"; 
var orderedWords = words 
    .Split(' ') 
    .GroupBy(x => x) 
    .Select(x => new { 
    KeyField = x.Key, 
    Count = x.Count() }) 
    .OrderByDescending(x => x.Count) 
    .Take(10);

來源

2016-11-21 06:34:58 JanneP

'ToList（）'是*冗餘*：'... words.Split（」「）.GroupBy（X = > x）...' –

非常真實的德米特里，這是不需要的。我編輯了代碼示例。 – JanneP

如果您將得到一個「*隨機文本*在一個txt文件」您當前routine'll遇到困難：你必須刪除所有的*標點符號*（*逗號*，*句號*等）;你必須處理* case *，例如''一個只是一個，但不是一個inc。「' - 單詞'one'出現兩次* –

所有的數據轉換成字符串，並將其分成數組

例如：

char[] delimiterChars = { ' ', ',', '.', ':', '\t' }; 
string text = "one\ttwo three:four,five six seven"; 

string[] words = text.Split(delimiterChars); 

var dict = new Dictionary<String, int>(); 
foreach(var value in array) 
{ 
    if (dict.ContainsKey(value)) 
     dict[value]++; 
    else 
     dict[value] = 1; 
} 

for(int i=0;i<dict.length();i++) //or i<10 
{ 
    Console.WriteLine(dict[i]); 
}

你需要用更大的價值數組排序第一。

來源

2016-11-21 06:49:15

計數器例子：'text =「一，二，三，四，四，五」;'預期的結果是''四''位於頂部。實際結果是*空字符串*規定他們全部。 –

任務中最困難的部分是分裂初始文本的話。 自然語言（比如英語）字是一個相當複雜的事情：

Forget-me-not  // 1 word (a nice blue flower) 
Do not Forget me! // 4 words 
Cannot   // 1 word or shall we split "cannot" into "can" + "not"? 
May not   // 2 words 
George W. Bush // Is "W" a word? 
W.A.S.P.   // ...If it is, is it equal to "W" in the "W.A.S.P"? 
Donald Trump  // Homonyms: name 
Spades is a trump // ...and a special follow in a game of cards 
It's an IT; it is // "It" and "IT" are different (IT is an acronym), "It" and "it" are same

另一個問題是這樣的：你可能要數It和it作爲一個和同一個詞，但IT爲不同首字母縮寫詞。作爲第一次嘗試，我認爲是這樣的：

var top10words = File 
    .ReadLines(@"C:\MyFile.txt") 
    .SelectMany(line => Regex 
    .Matches(value, @"[A-Za-z-']+") 
    .OfType<Match>() 
    .Select(match => CultureInfo.InvariantCulture.TextInfo.ToTitleCase(match.Value))) 
    .GroupBy(word => word) 
    .Select(chunk => new { 
    word = chunk.Key, 
    count = chunk.Count()}) 
    .OrderByDescending(item => item.count) 
    .ThenBy(item => item.word) 
    .Take(10);

在我的解決方案，我認爲：

詞可以包含A..Z, a..z，-（破折號）和'（APOSTROPH）字母只
TitleCase已用於將所有大寫首字母縮寫詞與普通單詞分開（It和it將被視爲同一個單詞，而IT爲不同的單詞）
在情況下扳平（兩個或多個單詞具有相同的頻）這條領帶是由字母順序

來源

2016-11-21 06:54:02

如何在一個文本中找到10個最常見的詞

回答

相關問題