檢查輸入文件對照詞彙表的單詞頻率python

我想以矢量（.toarray（））的形式創建文本文件包的文字表示。我正在使用代碼：檢查輸入文件對照詞彙表的單詞頻率python

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer(input="file") 
f = open('D:\\test\\45.txt') 
bag_of_words = vectorizer.fit_transform([f]) 
print(bag_of_words)

我想使用countvectorizer的詞彙表進行比較。我有文本文件，我標記並想用它作爲詞彙。怎麼做？

來源

2015-12-10 Masyaf

鑑於標記化被插入空格單個令牌創建從文本的詞彙之間所做的那樣簡單：

f = open('foo.txt') 
text = f.read() # text is a string 
tokens = text.split() # breaks the string in single tokens 
vocab = list(set(tokens)) # set() removes the doubles form the token list

來源

2015-12-11 13:32:30

以及如何與其他文本比較？ – Masyaf

通過使用python的set操作。 http://www.learnpython.org/en/Sets –

檢查輸入文件對照詞彙表的單詞頻率python

回答

相關問題