從.NET文本中分離單詞的最快選項？

我的任務是實現文本的簡單語義分析（800MB txt文件）。對於小文件，一切都很快。我一行一行地閱讀這個文件，並且正在工作。該文件讀取需要9s。但是，一旦您開始分析並將詞語添加到詞典並將其位置存儲在文本中，處理時間就會過長。從.NET文本中分離單詞的最快選項？

您能否告訴我更好的變異或者更好的解決方案？在處理文本和過程的語義分析問題時，我會建議您提供任何建議。你的。

public List<string> SplitWords(string s) 
    { 
     s = s.ToLower(); 
     arrayWords = Regex.Split(s, @"\W+"); 
     listWords = arrayWords.OfType<string>().ToList(); 

     for (int i = 0; i < listWords.Count; i++) 
     { 
      if (Array.BinarySearch(stopwords, listWords[i]) >= 0 || listWords[i].Length < 2) 
      { 
       listWords.RemoveAt(i); 
       i--; 
      } 

     } 
     return listWords; 
    }

代碼分離的話

public void AddToDictonary(List<string> arrayWords) 
     { 
      for (int i = 0; i < arrayWords.Count; i++) 
      { 
       if (!dictonary.ContainsKey(arrayWords[i])) 
       { 
        dictonary.Add(arrayWords[i], new List<int>() { i }); 
       } 
       else 
       { 
        dictonary[arrayWords[i]].Add(i); 
       } 
      } 
     }

代碼添加到字典中。

來源

2013-02-14 user2039847

與其問，你應該使用'TryGetValue'方法。請參閱：http://stackoverflow.com/questions/9382681/what-is-more-efficient-dictionary-trygetvalue-or-containskeyitem – 2013-02-14 00:48:26

我也建議你使用** dotTrace **或類似的工具。它會給你一個你的代碼的性能報告，你可以讓你的代碼的哪一部分是較慢的。 – 2013-02-14 00:50:59

我嘗試TryGetValue.Thanks。最慢的代碼是FOR（分詞字符函數），其中我比較了來自我的文本文件wtch 321個停止詞的數組中的每個單詞。我正在考慮使用StringBuilder。你怎麼看？速度比較如何？ – user2039847 2013-02-14 12:52:16

您可以使用正則表達式我張貼here如果字典包含單詞來標記你的句子

來源

2013-02-20 16:44:34

從.NET文本中分離單詞的最快選項？

回答

相關問題