0
我正試圖在一組描述中找到最近的鄰居。描述通常包含1-15個詞,我使用scikit的TfIdfVectorizer進行標記。然後,使用相同的矢量化器,我適合基本描述。然而,似乎是,矢量化分割這一個單獨的字符,而不是的話,因爲所得到的稀疏矩陣是形狀的[在語料庫中的唯一字基描述x個數量的字母]TfIdfVectorizer將單詞分成單個字符?
descriptions = 'total assets'
products = LoadData('C:/dict.csv', dtype = {'Code': np.str, 'LocalLanguageLabel': np.str})
products = products.fillna({'LocalLanguageLabel':''})
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(token_pattern=r'\b\w+\b')
#tried the below two as well
#vectorizer = TfidfVectorizer()
#vectorizer = TfidfVectorizer(token_pattern=r'\b\w+\b', analyzer = 'word')
dict_matrix = vectorizer.fit_transform(products['LocalLanguageLabel'])
input_matrix = vectorizer.transform(description)
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(metric='euclidean', algorithm='brute')
model.fit(dict_matrix)
distance, indices = model.kneighbors(input_matrix,n_neighbors = 10)
當我打印input_matrix,這是我所得到的(你可以猜到的索引中涉及到字符「totalassets」):
的預期print(input_matrix)
(0, 33478) 1.0 #t
(1, 24021) 1.0 #o
(2, 33478) 1.0 #t
(3, 2298) 1.0 #a
(4, 20272) 1.0 #l
(6, 2298) 1.0 #a
(7, 30874) 1.0 #s
(8, 30874) 1.0 #s
(9, 11386) 1.0 #e
(10, 33478) 1.0 #t
(11, 30874) 1.0 #s
<12x39859 sparse matrix of type '<class 'numpy.float64'>'
with 11 stored elements in Compressed Sparse Row format>
是什麼?我期望10個距離和10個索引,而不是我得到12個每個10個元素的列表。