如何打印lda主題模型和每個主題的文字雲

from nltk.tokenize import RegexpTokenizer 
from stop_words import get_stop_words 
from gensim import corpora, models 
import gensim 
import os 
from os import path 
from time import sleep 
import matplotlib.pyplot as plt 
import random 
from wordcloud import WordCloud, STOPWORDS 
tokenizer = RegexpTokenizer(r'\w+') 
en_stop = set(get_stop_words('en')) 
with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f: 
    Reader = f.read() 

Reader = Reader.replace("will", " ") 
Reader = Reader.replace("please", " ") 


texts = unicode(Reader, errors='replace') 
tdm = [] 

raw = texts.lower() 
tokens = tokenizer.tokenize(raw) 
stopped_tokens = [i for i in tokens if not i in en_stop] 
tdm.append(stopped_tokens) 

dictionary = corpora.Dictionary(tdm) 
corpus = [dictionary.doc2bow(i) for i in tdm] 
sleep(3) 
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=8, id2word = dictionary) 
topics = ldamodel.print_topics(num_topics=8, num_words=200) 
for i in topics: 
    print(i) 
    wordcloud = WordCloud().generate(i) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.show()

問題出在雲端。我無法爲8個主題中的每一個獲得單詞雲。我想要一個輸出爲8個主題提供8個字的雲。如果有人可以幫我解決這個問題，那將會很棒。如何打印lda主題模型和每個主題的文字雲

來源

2016-10-27 Raj

假設你已經培養了gensim LDA模型，你可以簡單地創建詞雲用下面的代碼

# lda is assumed to be the variable holding the LdaModel object 
import matplotlib.pyplot as plt 
for t in range(lda.num_topics): 
    plt.figure() 
    plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200))) 
    plt.axis("off") 
    plt.title("TopiC#" + str(t)) 
    plt.show()

我將突出你的代碼的一些錯誤，這樣你就可以更好地遵循上面我所編寫的。

WordCloud().generate(something)預計東西是原始文本。它會標記它，小寫它並刪除停用詞，然後計算詞雲。你需要的字大小匹配他們的概率在一個主題（我認爲）。

lda.print_topics(8, 200)返回prob1*"token1" + prob2*"token2" + ...中主題的文本表示，您需要lda.show_topic(topic, num_words)以相應的概率作爲元組來獲取單詞。然後您需要WordCloud().fit_words()來生成單詞雲。

以下代碼是具有上述可視化的代碼。我還想指出，你是從單個文件推斷主題，這是非常罕見的，可能不是你想要的。

from nltk.tokenize import RegexpTokenizer 
from stop_words import get_stop_words 
from gensim import corpora, models 
import gensim 
import os 
from os import path 
from time import sleep 
import matplotlib.pyplot as plt 
import random 
from wordcloud import WordCloud, STOPWORDS 
tokenizer = RegexpTokenizer(r'\w+') 
en_stop = set(get_stop_words('en')) 
with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f: 
    Reader = f.read() 

Reader = Reader.replace("will", " ") 
Reader = Reader.replace("please", " ") 


texts = unicode(Reader, errors='replace') 
tdm = [] 

raw = texts.lower() 
tokens = tokenizer.tokenize(raw) 
stopped_tokens = [i for i in tokens if not i in en_stop] 
tdm.append(stopped_tokens) 

dictionary = corpora.Dictionary(tdm) 
corpus = [dictionary.doc2bow(i) for i in tdm] 
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=8, id2word = dictionary) 
for t in range(ldamodel.num_topics): 
    plt.figure() 
    plt.imshow(WordCloud().fit_words(ldamodel.show_topic(t, 200))) 
    plt.axis("off") 
    plt.title("TopiC#" + str(t)) 
    plt.show()

雖然從一個不同的庫，你可以看到topic visualizations with corresponding code的結果會是什麼（聲明：我對圖書館的作者）。

來源

2016-10-27 10:57:52 katharas

非常感謝。這當然解決了我的問題。我很抱歉，我現在無法贊成，因爲我沒有獲得這樣的聲譽 – Raj

我實際上已經挖掘了jobsdb數據並將其用於分析。抓取的數據在一個用於主題建模的文件下編譯。 – Raj

感謝您的回答。然而，在最新版本的wordcloud中，'fit_words'需要一個字典，而'lda.show_topic'返回一個元組列表。我不得不使用'plt.imshow'（WordCloud（）。fit_words（dict（lda.show_topic（t，200））））'來使它工作。 –

以下爲我工作：首先，創建一個LDA模型，並在Topic Clustering討論定義集羣/主題 - 確保minimum_probability爲0 接下來，確定使用lda_corpus = lda[corpus] 現在請確定從文檔的語料庫LDA屬於每個Topic的數據作爲列表，下面的例子有兩個主題。 df是具有列文本的原始數據

cluster1 = [j for i,j in zip(lda_corpus,df.texts) if i[0][1] > .2] 
cluster2 = [j for i,j in zip(lda_corpus,df.texts) if i[1][1] > .2]

爲每個羣集獲取Word Cloud。我們可以包含儘可能多的停用詞。確保清除集羣中的數據，如刪除停用詞，詞幹等。我跳過這些步驟，以便每個集羣都將清除文本/文檔。

wordcloud = WordCloud(relative_scaling = 1.0, stopwords=("xxx", 'yyy').generate(' '. join(cluster1))

最後情節詞雲使用matplotlib

plt.imshow(wordcloud)

來源

2017-04-19 01:38:21

如何打印lda主題模型和每個主題的文字雲

回答

相關問題