PCA和K-means for word clustering

我有維基百科文章的語料庫。我找出了10,000個最常用的單詞，找到了它們的Word2Vec向量，並且在這些向量上使用了球形k-均值，根據意義上的相似性將這些單詞聚類爲500個組。PCA和K-means for word clustering

我挑出了3個單詞集並將單詞轉換回單詞向量。每個單詞向量都是一個300個數值的數組，所以我將它們全部應用PCA（從sklearn）將它們轉換爲2D。然後我繪製：

每個點代表一個字，每種顏色代表1簇。問題是，這些羣集不應該重疊。一個集羣有與計算機有關的詞彙，另一個集羣有與種族有關的詞彙，而最後一個集羣有與關係有關的詞彙。我用計算機單詞將「雞」字添加到羣集中，但在繪製時，其點位於「鍵盤」點旁邊。

我不確定這裏發生了什麼問題。我的方法有什麼問題嗎？這是我的PCA代碼：

for words in theList: #theList is an array of my 3 clusters 
    lexicalUnitVectors = load_bin_vec("GoogleNews-vectors-negative300.bin", words) #convert words to Word2Vec vectors 
    lexicalUnitVectors = list(lexicalUnitVectors.values()) 
    lexicalUnitVectors = pca.fit(lexicalUnitVectors).transform(lexicalUnitVectors) #apply pca 
    print(lexicalUnitVectors) #this shows a bunch of 2D points; all x and y values are close to 0 for some reason 
    xs = [i*1 for i in lexicalUnitVectors[:, 0]] #ignore this 
    ys = [i*1 for i in lexicalUnitVectors[:, 1]] #ignore this 
    plt.scatter(xs, ys, marker = 'o') 
    plt.show()

來源

2017-09-17 dvn

1）通常，我認爲您應該在進行聚類之前應用PCA。這是PCA的一個重點，即降低維度，以便您可以只集中於獨特的方面。 2）我不知道我是否同意你的想法，即前兩個特徵向量必須是分開的 - 如果你的維度從單詞中減少，那麼對於每個聚類都有很多特徵向量是重要的。你要保留多少個特徵向量？通常情況下，您只保留這樣的數據，以解釋數據變化的90％，但您應該仔細研究這一點。

來源

2017-09-17 17:12:20 flyingmeatball

PCA和K-means for word clustering

回答

相關問題