使用Spark LDA可視化主題

我正在使用pySpark ML LDA庫來適應sklearn的20個新聞組數據集上的主題模型。我正在對訓練語料庫進行標準化標記化，停止詞移除和tf-idf轉換。最後，我可以得到的主題和打印出來的字指數及其權重：使用Spark LDA可視化主題

topics = model.describeTopics() 
topics.show() 
+-----+--------------------+--------------------+ 
|topic|   termIndices|   termWeights| 
+-----+--------------------+--------------------+ 
| 0|[5456, 6894, 7878...|[0.03716766297248...| 
| 1|[5179, 3810, 1545...|[0.12236370744240...| 
| 2|[5653, 4248, 3655...|[1.90742686393836...| 
...

然而，如何從長期指標與實際單詞映射到可視化的主題？我使用HashingTF應用於字符串的標記化列表來導出術語索引。如何生成用於可視化主題的詞典（從索引到單詞的映射）？

來源

2017-05-29 Vadim Smolyakov

到HashingTF另一種是產生一個詞彙CountVectorizer：

count_vec = CountVectorizer(inputCol="tokens_filtered", outputCol="tf_features", vocabSize=num_features, minDF=2.0) 
count_vec_model = count_vec.fit(newsgroups) 
newsgroups = count_vec_model.transform(newsgroups) 
vocab = count_vec_model.vocabulary

給定一個詞彙作爲單詞的列表，我們可以索引到它的可視化主題：

topics = model.describeTopics() 
topics_rdd = topics.rdd 

topics_words = topics_rdd\ 
     .map(lambda row: row['termIndices'])\ 
     .map(lambda idx_list: [vocab[idx] for idx in idx_list])\ 
     .collect() 

for idx, topic in enumerate(topics_words): 
    print "topic: ", idx 
    print "----------" 
    for word in topic: 
     print word 
    print "----------"

來源

2017-05-29 03:53:04

使用Spark LDA可視化主題

回答

相關問題