Countvectorizer有詞不在數據中

我是sklearn和countvectorizer的新手。Countvectorizer有詞不在數據中

一些奇怪的行爲正在發生在我身上。

初始化矢量化

from sklearn.feature_extraction.text import CountVectorizer 
count_vect = CountVectorizer() 
document_mtrx = count_vect.fit_transform(df['description']) 
count_vect.vocabulary_ 
count_vect.vocabulary_ 
Out[28]: 
{u'viewscity': 36216, 
u'sizeexposed': 31584, 
u'rentalcontact': 29104, 
u'villagebldg': 36323,

計獲取包含rentalcontact

df[df['description'].str.contains('rentalcontact')]

字返回的行數爲0。爲什麼會出現這種情況的行？

來源

2017-03-11 aceminer

CountVectorizer有一個參數lowercase默認爲True - 最有可能的是，這就是爲什麼你找不到這些值。

那麼試試這個：

df[df['description'].str.lower().str.contains('rentalcontact')] 
#      ^^^^^^^

UPDATE：

vocabulary_：字典

術語的映射到特徵指數。

u'rentalcontact': 29104 - 意味着'rentalcontact'在功能列表中的索引29104。

I.e. vectorizer.get_feature_names()[29104]應該返回'rentalcontact'

來源

2017-03-11 10:00:32 MaxU

此術語發生的次數是29104.但是，當我運行我的最後一行代碼時，它只返回了1個結果。有什麼我也錯過了嗎？ – aceminer

@aceminer，AFAIK'29104'是排序功能列表中'rentalcontact'的索引。如何檢查：'print（vectorizer.get_feature_names（）[29104]）' – MaxU

如何獲取術語的頻率？ – aceminer

Countvectorizer有詞不在數據中

回答

相關問題