假設我有一些文本句子,我想用kmeans進行聚類。如何將新數據轉換爲我的培訓數據的PCA組件?
sentences = [
"fix grammatical or spelling errors",
"clarify meaning without changing it",
"correct minor mistakes",
"add related resources or links",
"always respect the original author"
]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)
現在我能預測哪些類的一個新的文本會陷入,
new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]
不過,說我申請PCA減少10,000個至50
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)
我由於矢量化器的結果不再相關,所以不能再做同樣的事情來預測新文本的簇
new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50
那麼如何將新文本轉換爲較低維特徵空間呢?