2013-03-01 182 views
1

我想找到特定文件集合中的文本文件的頻率和反轉文檔頻率(TF-IDF)一詞。如何從文本文件中刪除和統計單詞?

因此,在這種情況下,我只想來計算總的話中的文件,尤其是詞的出現次數在文件中並刪除像aanthe的話,等

是否有任何解析器在vb.net?
在此先感謝。

+0

經過這個[教程](http://www.codeproject.com/Questions/302262/How-to-search-specific-string-into-分離文本文件),並告訴我是否有幫助。 – 2013-03-01 05:40:03

回答

1

最簡單的方法,我知道是這樣的:

Private Function CountWords(Filename as String) As Integer 
    Return IO.File.ReadAllText(Filename).Split(" ").Count 
End Function 

如果你想刪除你可以的單詞:

Private Sub RemoveWords (Filename as String, DeleteWords as List(Of String)) 
    Dim AllWords() As String = IO.File.ReadAllText(Filename).Split(" ") 
    Dim RemainingWords() As String = From Word As String In AllWords 
            Where DeleteWords.IndexOf(Word) = -1 

    'Do something with RemainingWords ex: 
    'IO.File.WriteAllText(Filename, String.Join(vbNewLine, RemainingWords) 
End Sub  

此假設字被與空間

0

也許regular expressions會幫助你:

Using System.IO 
Using System.Text.RegularExpressions 

... 

Dim anyWordPattern As String = "\b\w+\b" 
Dim myWordPattern As String = "\bMyWord\b" 
Dim replacePattern As String = "\b(?<sw>a|an|the)\b" 
Dim content As String = File.ReadAllText(<file name>) 
Dim coll = Regex.Matches(content, anyWordPattern) 
Console.WriteLine("Total words: {0}", coll.Count) 
coll = Regex.Matches(content, myWordPattern, RegexOptions.Multiline Or RegexOptions.IgnoreCase) 
Console.WEriteLine("My word occurrences: {0}", coll.Count) 
Dim replacedContent = Regex.Replace(content, replacePattern, String.Empty, RegexOptions.Multiline Or RegexOptions.IgnoreCase) 
Console.WriteLine("Replaced content: {0}", replacedContent) 

說明對正則表達式中使用:

  • \ b - 字邊界;
  • \ w - 任何單詞字符;
  • + - 量詞,1或很多;
  • (?...) - 命名組,叫做 「SW」 - 停止詞
  • 一個|的|的 - 替代方案, 「一」 或 「一」 或 「該」
1

最簡單的這樣做,這是閱讀文本文件轉換成一個字符串,然後使用.NET Framework找到匹配:

Dim text As String = File.ReadAllText("D:\Temp\MyFile.txt") 
Dim index As Integer = text.IndexOf("hello") 
If index >= 0 Then 
    ' String is in file, starting at character "index" 
End If 

或解決方案2您需要的StreamReader和至REGx了點。

//read file content in StreamReader 
StreamReadertxt Reader = new StreamReader(fName); 
szReadAll = txtReader.ReadToEnd();//Reads the whole text file to the end 
txtReader.Close(); //Closes the text file after it is fully read. 
txtReader = null; 
//search word in file content 
if (Regex.IsMatch(szReadAll, "SearchME", RegexOptions.IgnoreCase))//If the match is found in allRead 
    MessageBox.Show("found"); 
else 
    MessageBox.Show("not found"); 

這就是所有,我希望這可以解決您的疑問。 問候

0

你可以嘗試這樣的事:

Dim text As String = IO.File.ReadAllText("C:\file.txt") 
Dim wordsToSearch() As String = New String() {"Hello", "World", "foo"} 
Dim words As New List(Of String)() 
Dim findings As Dictionary(Of String, List(Of Integer)) 

'Dividing into words' 
words.AddRange(text.Split(New String() {" ", Environment.NewLine()}, StringSplitOptions.RemoveEmptyEntries)) 
'Discarting all the words you dont want' 
words.RemoveAll(New Predicate(Of String)(AddressOf WordsDeleter)) 

findings = SearchWords(words, wordsToSearch) 

Console.WriteLine("Number of 'foo': " & findings("foo").Count) 

和所使用的功能:

Private Function WordsDeleter(ByVal obj As String) As Boolean 
    Dim wordsToDelete As New List(Of String)(New String() {"a", "an", "then"}) 
    Return wordsToDelete.Contains(obj.ToLower) 
End Function 

Private Function SearchWords(ByVal allWords As List(Of String), ByVal wordsToSearch() As String) As Dictionary(Of String, List(Of Integer)) 
    Dim dResult As New Dictionary(Of String, List(Of Integer))() 
    Dim i As Integer = 0 

    For Each s As String In wordsToSearch 
     dResult.Add(s, New List(Of Integer)) 

     While i >= 0 AndAlso i < allWords.Count 
      i = allWords.IndexOf(s, i) 
      If i >= 0 Then dResult(s).Add(i) 
      i += 1 
     End While 
    Next 

    Return dResult 
End Function 
相關問題