2017-07-24 47 views
0

我有一套我想要集羣的wikipedia文本。k-means中的特徵權重

的代碼如下:

import pandas as pd            
import numpy as np            
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.cluster import KMeans 

#parameters 
maximum_features = 1000000 
max_intera = 300 

#load text file 
wiki = pd.read_csv('people_wiki.csv') 

#TF-IDF vectorization 
vectorizer = TfidfVectorizer(max_features=maximum_features, norm = 'l2', stop_words='english') 
tfidf = vectorizer.fit_transform(wiki['text']) 

#clustering 
kmeans = KMeans(n_clusters=3, random_state=0, init='k-means++', max_iter = max_intera).fit(tfidf) 

我想知道每個特徵的權重,像這裏顯示(她0.025她:0.017 .....):

enter image description here

總結:我希望每個特徵(單詞)的權重和呈現5更相關。

文件 'people_wiki.csv' 是在這裏:

https://ufile.io/udg1y

回答

1

嘗試使用此解決方案:

print(tfidf.idf_)