4
我正在嘗試使用scikit使用餘弦相似性來查找類似的問題。我正在試圖在互聯網上提供這個示例代碼。 Link1和Link2使用scikit-learn時出現屬性錯誤
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA
train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright."]
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
trainVectorizerArray = vectorizer.
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
for vector in trainVectorizerArray:
print vector
for testV in testVectorizerArray:
print testV
cosine = cx(vector, testV)
print cosine
transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
我總是得到這個錯誤
Traceback (most recent call last):
File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module>
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn \feature_extraction\text.py", line 740, in fit_transform
raise ValueError("empty vocabulary; training set may have"
ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).
我甚至可以檢查代碼上this link。我有錯誤AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'
。
如何解決這個問題?
我在Windows 7 32位和scikit_learn 0.13.1上使用Python 2.7.3。
哦!這解決了這個問題..但是,你能告訴我什麼是詞彙功能...當我嘗試使用這個功能時,它爲什麼會給出屬性錯誤 – 2013-03-05 10:33:08
@AnimeshPandey:錯誤消息中正確的是:「空的詞彙;培訓集合可能只包含停用詞或min_df(resp.max_df)可能太高(或太低)。「正如我所解釋的,默認設置「min_df = 2」太低,因爲您只有兩個文檔。 (請注意,tf-idf在這麼少的文檔中工作不正常。) – 2013-03-05 10:33:45
在調用fit方法時(除非用戶提供了構造函數參數),將提取帶有尾部'_'的'vocabulary_'。請參閱[文檔](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)。 – ogrisel 2013-03-05 10:34:00