我試圖找到使用sklearn的LDA模型的最佳主題數量。要做到這一點，我通過引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2上的代碼來計算困惑。如何解讀Sklearn LDA困惑評分。爲什麼它隨着話題數量的增加而不斷增加？

但是當我增加話題的數量時，困惑總會非理性地增加。我在實現中遇到錯誤還是隻能提供正確的值？

from __future__ import print_function 
from time import time 

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
from sklearn.decomposition import NMF, LatentDirichletAllocation 
n_samples = 0.7 
n_features = 1000 
n_top_words = 20 
dataset = kickstarter['short_desc'].tolist() 
data_samples = dataset[:int(len(dataset)*n_samples)] 
test_samples = dataset[int(len(dataset)*n_samples):]

對LDA使用tf（raw term count）特徵。

print("Extracting tf features for LDA...") 
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, 
           max_features=n_features, 
           stop_words='english') 
t0 = time() 
tf = tf_vectorizer.fit_transform(data_samples) 
print("done in %0.3fs." % (time() - t0)) 
# Use tf (raw term count) features for LDA. 
print("Extracting tf features for LDA...") 
t0 = time() 
tf_test = tf_vectorizer.transform(test_samples) 
print("done in %0.3fs." % (time() - t0))

計算的困惑（5，10，15個... 100個主題）困惑計算

for i in xrange(5,101,5): 
    n_topics = i 

    print("Fitting LDA models with tf features, " 
      "n_samples=%d, n_features=%d n_topics=%d " 
      % (n_samples, n_features, n_topics)) 

    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
            learning_method='online', 
            learning_offset=50., 
            random_state=0) 
    t0 = time() 
    lda.fit(tf) 

    train_gamma = lda.transform(tf) 
    train_perplexity = lda.perplexity(tf, train_gamma) 

    test_gamma = lda.transform(tf_test) 
    test_perplexity = lda.perplexity(tf_test, test_gamma) 

    print('sklearn preplexity: train=%.3f, test=%.3f' % 
      (train_perplexity, test_perplexity)) 

    print("done in %0.3fs." % (time() - t0))

結果

Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=5 
sklearn preplexity: train=9500.437, test=12350.525 
done in 4.966s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=10 
sklearn preplexity: train=341234.228, test=492591.925 
done in 4.628s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=15 
sklearn preplexity: train=11652001.711, test=17886791.159 
done in 4.337s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=20 
sklearn preplexity: train=402465954.270, test=609914097.869 
done in 4.351s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=25 
sklearn preplexity: train=14132355039.630, test=21945586497.205 
done in 4.438s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=30 
sklearn preplexity: train=499209051036.715, test=770208066318.557 
done in 4.076s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=35 
sklearn preplexity: train=16539345584599.268, test=24731601176317.836 
done in 4.230s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=40 
sklearn preplexity: train=586526357904887.250, test=880809950700756.625 
done in 4.596s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=45 
sklearn preplexity: train=20928740385934636.000, test=31065168894315760.000 
done in 4.563s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=50 
sklearn preplexity: train=734804198843926784.000, test=1102284263786783616.000 
done in 4.790s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=55 
sklearn preplexity: train=24747026375445286912.000, test=36634830286916853760.000 
done in 4.839s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=60 
sklearn preplexity: train=879215493067590729728.000, test=1268331920975308783616.000 
done in 4.827s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=65 
sklearn preplexity: train=30267393208097070645248.000, test=43678395923698735382528.000 
done in 4.705s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=70 
sklearn preplexity: train=1091388615092136975532032.000, test=1564111432914603675222016.000 
done in 4.626s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=75 
sklearn preplexity: train=37463573890268863118966784.000, test=51513357456275195169865728.000 
done in 5.034s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=80 
sklearn preplexity: train=1281758440147129243608809472.000, test=1736796133443165299937378304.000 
done in 5.348s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=85 
sklearn preplexity: train=45100838968058242714191265792.000, test=62725627465378386290422054912.000 
done in 4.987s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=90 
sklearn preplexity: train=1555576278144903954081448460288.000, test=2117105172204280105824751190016.000 
done in 5.032s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=95 
sklearn preplexity: train=52806759455785055803020813533184.000, test=70510180325555822379548402515968.000 
done in 5.284s. 
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=100 
sklearn preplexity: train=1885916623308147578324101753733120.000, test=2505878598724106449894719231098880.000 
done in 5.374s.

來源

2017-08-13 JonghoKim

我可以問你爲什麼還原同行批准的編輯？我認爲這個問題很有趣，但在目前的狀態下解釋是非常困難的。糟糕的語法使得它基本上不可讀。 – elphz

除了語法問題，糾正後的句子意味着什麼與我想要的不同。例如，如果你增加主題的數量，我認爲一般情況下會減少困惑。儘管目前的結果不合適，但增加或減少並不是這樣的價值。 – JonghoKim

好吧，我仍然認爲這基本上是編輯所反映的，儘管強調單調（總是增加或總是減少），而不是簡單地減少。你當前的問題陳述令人困惑，因爲你的結果並不總是隨着主題數量而增加，而是有時會增加，有時會減少（我相信你在這裏指的是「非理性」 - 這可能在翻譯中丟失 - 非理性是一個不同的詞在數學上，在這種情況下沒有意義，我會建議改變它） – elphz

有一個錯誤scikit學習造成困惑增加：

https://github.com/scikit-learn/scikit-learn/issues/6777

來源

2017-10-01 22:02:22 user179041

如何解讀Sklearn LDA困惑評分。爲什麼它隨着話題數量的增加而不斷增加？

對LDA使用tf（raw term count）特徵。

計算的困惑（5，10，15個... 100個主題）困惑計算

結果

回答

相關問題