sklearn CountVectorizer

我有疑惑使用vocabulary_.get，代碼如下。如下圖所示，我在一臺機器學習練習中使用了CountVectorizer來計算特定單詞的出現次數。sklearn CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer 
vectorizer = CountVectorizer() 
s1 = 'KJ YOU WILL BE FINE' 
s2 = 'ABHI IS MY BESTIE' 
s3 = 'sam is my bestie' 
frnd_list = [s1,s2,s3] 
bag_of_words = vectorizer.fit(frnd_list) 
bag_of_words = vectorizer.transform(frnd_list) 
print(bag_of_words) 
# To get the feature word number from word 
#for eg: 
print(vectorizer.vocabulary_.get('bestie')) 
print(vectorizer.vocabulary_.get('BESTIE'))

OUTPUT：

Bag_of_words is : 
(0, 1) 1 
(0, 3) 1 
(0, 5) 1 
(0, 8) 1 
(0, 9) 1 
(1, 0) 1 
(1, 2) 1 
(1, 4) 1 
(1, 6) 1 
(2, 2) 1 
(2, 4) 1 
(2, 6) 1 
(2, 7) 1 

'bestie' has feature number: 
2 
'BESTIE' has feature number: 
None

因此，我懷疑的是，爲什麼 'bistie' 顯示正確的要素數即2和 '死黨' 顯示無。不是vocabulary_.get不適合使用資本向量？

來源

2017-10-09 Kinjal Kachi

CountVectorizer需要一個參數lowercase默認爲True，如文檔here中指出：

lowercase : boolean, True by default 
    Convert all characters to lowercase before tokenizing.

變化，爲False，如果你想治療小寫和大寫字母不同。

來源

2017-10-09 17:00:47 MedAli

sklearn CountVectorizer

回答

相關問題