2014-10-03 67 views
5

假設我有一些文本句子,我想用kmeans進行聚類。如何將新數據轉換爲我的培訓數據的PCA組件?

sentences = [ 
    "fix grammatical or spelling errors", 
    "clarify meaning without changing it", 
    "correct minor mistakes", 
    "add related resources or links", 
    "always respect the original author" 
] 

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.cluster import KMeans 

vectorizer = CountVectorizer(min_df=1) 
X = vectorizer.fit_transform(sentences) 
num_clusters = 2 
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1) 
km.fit(X) 

現在我能預測哪些類的一個新的文本會陷入,

new_text = "hello world" 
vec = vectorizer.transform([new_text]) 
print km.predict(vec)[0] 

不過,說我申請PCA減少10,000個至50

from sklearn.decomposition import RandomizedPCA 

pca = RandomizedPCA(n_components=50,whiten=True) 
X2 = pca.fit_transform(X) 
km.fit(X2) 

我由於矢量化器的結果不再相關,所以不能再做同樣的事情來預測新文本的簇

new_text = "hello world" 
vec = vectorizer.transform([new_text]) ## 
print km.predict(vec)[0] 
ValueError: Incorrect number of features. Got 10000 features, expected 50 

那麼如何將新文本轉換爲較低維特徵空間呢?

回答

4

您想在將新數據提供給模型之前使用pca.transform。這將使用與原始數據上運行pca.fit_transform時相同的PCA模型執行降維。然後,您可以使用您的擬合模型來預測減少的數據。

基本上可以認爲它是一個大型模型,它由三個小型模型組成。首先,您有一個CountVectorizer模型,用於確定如何處理數據。然後運行一個執行降維的RandomizedPCA模型。最後你運行一個KMeans模型進行聚類。當你適合這些模型時,你會沿着這個堆棧走下去並適應每一個模型。而當你想做預測的時候,你也必須走下堆棧並應用每一個。

# Initialize models 
vectorizer = CountVectorizer(min_df=1) 
pca = RandomizedPCA(n_components=50, whiten=True) 
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1) 

# Fit models 
X = vectorizer.fit_transform(sentences) 
X2 = pca.fit_transform(X) 
km.fit(X2) 

# Predict with models 
X_new = vectorizer.transform(["hello world"]) 
X2_new = pca.transform(X_new) 
km.predict(X2_new) 
3

使用Pipeline

>>> from sklearn.cluster import KMeans 
>>> from sklearn.decomposition import RandomizedPCA 
>>> from sklearn.decomposition import TruncatedSVD 
>>> from sklearn.feature_extraction.text import CountVectorizer 
>>> from sklearn.pipeline import make_pipeline 
>>> sentences = [ 
...  "fix grammatical or spelling errors", 
...  "clarify meaning without changing it", 
...  "correct minor mistakes", 
...  "add related resources or links", 
...  "always respect the original author" 
... ] 
>>> vectorizer = CountVectorizer(min_df=1) 
>>> svd = TruncatedSVD(n_components=5) 
>>> km = KMeans(n_clusters=2, init='random', n_init=1) 
>>> pipe = make_pipeline(vectorizer, svd, km) 
>>> pipe.fit(sentences) 
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', 
     dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', 
     lowercase=True, max_df=1.0, max_features=None, min_df=1, 
     ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1, 
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, 
    verbose=1))]) 
>>> pipe.predict(["hello, world"]) 
array([0], dtype=int32) 

(顯示TruncatedSVD因爲RandomizedPCA將停止在即將發佈的文本頻率矩陣的工作,它實際上執行的SVD,不飽滿PCA,反正)