2016-07-02 12 views
0

我正在嘗試使用SciKIt中的餘弦相似度來學習KNN,但它一直在拋出這些警告。有人可以解釋這些的含義是什麼,爲什麼只有當我試圖用餘弦相似性擬合KNN模型而沒有使用任何其他距離度量時纔會出現這種情況?帶TF-IDF的KNN投擲「重塑你的數據」帶有餘弦相似度的警告作爲距離度量

代碼:

t0 = time.time() 
count_vect = CountVectorizer() 
X_train_counts = count_vect.fit_transform(X) 

tfidf_transformer = TfidfTransformer() 
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) 

vectorizer = TfidfVectorizer() 
vec_fit = vectorizer.fit_transform(X) 

t1 = time.time() 
total = t1-t0 
print "TF-IDF built:", total 

#######################------------------------############################ 

t0 = time.time() 
nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity) 
nbrs.fit(X_train_tfidf.toarray())#,Y) 
#KD_TREE won't work here becuase it doesn't work with Sparse Matrix -- on giving it a dense matrix, it throws a memory error 

t1 = time.time() 
total = t1-t0 
print "KNN Built:", total 

反覆警告消息:

C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is depreca 
ted in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single 
feature or X.reshape(1, -1) if it contains a single sample. 
    DeprecationWarning) 

在建議試着這樣做:

nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity) 
nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1)) 

這引發以下錯誤:

Traceback (most recent call last): 
    File ".\tf-idf.py", line 54, in <module> 
    nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1)) 
    File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 816, in fit 
    return self._fit(X) 
    File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 221, in _fit 
    X = check_array(X, accept_sparse='csr') 
    File "C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py", line 373, in check_array 
    array = np.array(array, dtype=dtype, order=order, copy=copy) 
ValueError: setting an array element with a sequence. 

回答

0

對我來說這沒有意義,這不與其他指標(如linear_kernel)顯示,我猜這是他們忘記(?)更新,因爲(linear_kernelcosine_similarity)都是內核操作。

爲了解決這個問題,你會得到這個錯誤,因爲fit()方法需要一個2維數組,但是你傳遞的是1維數組。 例如,這將提出這個警告X_train_tfidf=np.array([1,2,3,4.234,213.2]),因爲它具有形狀5.另一方面,這不會X_train_tfidf=np.array([[1,2,3,4.234,213.2]]),因爲它具有形狀(5,1)並且因此是二維的。

什麼警告消息顯示是把你的1維陣列,並轉換成其等效於X_train_tfidf=np.array([[1,2,3,4.234,213.2]])

內核矩陣的2維狀X_train_tfidf=np.array([1,2,3,4.234,213.2]).reshape(1, -1)基本上線性代數的兒童和涉及矩陣運算,其是由默認2維。

希望它是有道理的,如果沒有,請大喊。

+0

TF-IDF是一個稀疏矩陣,所以我真的不怎麼處理它。並且numpy.array()。reshape(1,-1)----不起作用。編輯答案。 – user3667569

+0

試試這個'X_train_tfidf.toarray()。reshape(1,-1)' – kazAnova

+0

試過了。同樣的問題。 – user3667569