2017-01-17 75 views
1

我有以下代碼:創建詞彙辭典文本挖掘

train_set = ("The sky is blue.", "The sun is bright.") 
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.") 

現在我試着去計算這樣的詞頻:

from sklearn.feature_extraction.text import CountVectorizer 
    vectorizer = CountVectorizer() 

接下來我想打印voculabary。所以我做的:

vectorizer.fit_transform(train_set) 
print vectorizer.vocabulary 

現在我得到的輸出中沒有。雖然我期望類似的東西:

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3} 

任何想法,這出錯了?

+1

[CountVectorizer不能打印詞彙表]的可能重複(http://stackoverflow.com/questions/28894756/countvectorizer-does-not-print-vocabulary) –

回答

2

我想你可以試試這個:

print vectorizer.vocabulary_ 
1

CountVectorizer不支持你在找什麼。

可以使用Counter類:

from collections import Counter 

train_set = ("The sky is blue.", "The sun is bright.") 
word_counter = Counter() 
for s in train_set: 
    word_counter.update(s.split()) 

print(word_counter) 

給人

Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1}) 

或者你可以使用FreqDist從NLTK:

from nltk import FreqDist 

train_set = ("The sky is blue.", "The sun is bright.") 
word_dist = FreqDist() 
for s in train_set: 
    word_dist.update(s.split()) 

print(dict(word_dist)) 

給人

{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}