我想找到特定文件集合中的文本文件的頻率和反轉文檔頻率(TF-IDF)一詞。如何從文本文件中刪除和統計單詞?
因此,在這種情況下,我只想來計算總的話中的文件,尤其是詞的出現次數在文件中並刪除像a
,an
,the
的話,等
是否有任何解析器在vb.net?
在此先感謝。
我想找到特定文件集合中的文本文件的頻率和反轉文檔頻率(TF-IDF)一詞。如何從文本文件中刪除和統計單詞?
因此,在這種情況下,我只想來計算總的話中的文件,尤其是詞的出現次數在文件中並刪除像a
,an
,the
的話,等
是否有任何解析器在vb.net?
在此先感謝。
最簡單的方法,我知道是這樣的:
Private Function CountWords(Filename as String) As Integer
Return IO.File.ReadAllText(Filename).Split(" ").Count
End Function
如果你想刪除你可以的單詞:
Private Sub RemoveWords (Filename as String, DeleteWords as List(Of String))
Dim AllWords() As String = IO.File.ReadAllText(Filename).Split(" ")
Dim RemainingWords() As String = From Word As String In AllWords
Where DeleteWords.IndexOf(Word) = -1
'Do something with RemainingWords ex:
'IO.File.WriteAllText(Filename, String.Join(vbNewLine, RemainingWords)
End Sub
此假設字被與空間
也許regular expressions會幫助你:
Using System.IO
Using System.Text.RegularExpressions
...
Dim anyWordPattern As String = "\b\w+\b"
Dim myWordPattern As String = "\bMyWord\b"
Dim replacePattern As String = "\b(?<sw>a|an|the)\b"
Dim content As String = File.ReadAllText(<file name>)
Dim coll = Regex.Matches(content, anyWordPattern)
Console.WriteLine("Total words: {0}", coll.Count)
coll = Regex.Matches(content, myWordPattern, RegexOptions.Multiline Or RegexOptions.IgnoreCase)
Console.WEriteLine("My word occurrences: {0}", coll.Count)
Dim replacedContent = Regex.Replace(content, replacePattern, String.Empty, RegexOptions.Multiline Or RegexOptions.IgnoreCase)
Console.WriteLine("Replaced content: {0}", replacedContent)
說明對正則表達式中使用:
最簡單的這樣做,這是閱讀文本文件轉換成一個字符串,然後使用.NET Framework找到匹配:
Dim text As String = File.ReadAllText("D:\Temp\MyFile.txt")
Dim index As Integer = text.IndexOf("hello")
If index >= 0 Then
' String is in file, starting at character "index"
End If
或解決方案2您需要的StreamReader和至REGx了點。
//read file content in StreamReader
StreamReadertxt Reader = new StreamReader(fName);
szReadAll = txtReader.ReadToEnd();//Reads the whole text file to the end
txtReader.Close(); //Closes the text file after it is fully read.
txtReader = null;
//search word in file content
if (Regex.IsMatch(szReadAll, "SearchME", RegexOptions.IgnoreCase))//If the match is found in allRead
MessageBox.Show("found");
else
MessageBox.Show("not found");
這就是所有,我希望這可以解決您的疑問。 問候
你可以嘗試這樣的事:
Dim text As String = IO.File.ReadAllText("C:\file.txt")
Dim wordsToSearch() As String = New String() {"Hello", "World", "foo"}
Dim words As New List(Of String)()
Dim findings As Dictionary(Of String, List(Of Integer))
'Dividing into words'
words.AddRange(text.Split(New String() {" ", Environment.NewLine()}, StringSplitOptions.RemoveEmptyEntries))
'Discarting all the words you dont want'
words.RemoveAll(New Predicate(Of String)(AddressOf WordsDeleter))
findings = SearchWords(words, wordsToSearch)
Console.WriteLine("Number of 'foo': " & findings("foo").Count)
和所使用的功能:
Private Function WordsDeleter(ByVal obj As String) As Boolean
Dim wordsToDelete As New List(Of String)(New String() {"a", "an", "then"})
Return wordsToDelete.Contains(obj.ToLower)
End Function
Private Function SearchWords(ByVal allWords As List(Of String), ByVal wordsToSearch() As String) As Dictionary(Of String, List(Of Integer))
Dim dResult As New Dictionary(Of String, List(Of Integer))()
Dim i As Integer = 0
For Each s As String In wordsToSearch
dResult.Add(s, New List(Of Integer))
While i >= 0 AndAlso i < allWords.Count
i = allWords.IndexOf(s, i)
If i >= 0 Then dResult(s).Add(i)
i += 1
End While
Next
Return dResult
End Function
經過這個[教程](http://www.codeproject.com/Questions/302262/How-to-search-specific-string-into-分離文本文件),並告訴我是否有幫助。 – 2013-03-01 05:40:03