Mallet

中每個主題的詞語分佈p（w | t）我需要獲得以Java編程的Mallet找到的每個主題的單詞分佈（不在CLI中，如how to get a probability distribution for a topic in mallet?中所述）。對於我的意思的例子：Introduction to Latent Dirichlet Allocation：Mallet

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) 
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

槌提供每個主題令牌「砝碼」，並在http://comments.gmane.org/gmane.comp.ai.mallet.devel/2064有人試圖寫一個方法讓每個主題字槌的分佈。

我修改了這個方法，使得所有的權重除以上面郵件列表中討論的總和。

以下方法（當添加到ParallelTopicModel.java時）是否正確計算Mallet中每個主題詞（p | w | t）的分佈？

/** 
* Get the normalized topic word weights (weights sum up to 1.0) 
* @param topic the topic 
* @return the normalized topic word weights (weights sum up to 1.0) 
*/ 
public ArrayList<double[]> getNormalizedTopicWordWeights(int topic) { 
    ArrayList<double[]> tokenWeights = new ArrayList<double[]>(); 
    for (int type = 0; type < numTypes; type++) { 
     int[] topicCounts = typeTopicCounts[type]; 
     double weight = beta; 
     int index = 0; 
     while (index < topicCounts.length && topicCounts[index] > 0) { 
      int currentTopic = topicCounts[index] & topicMask; 
      if (currentTopic == topic) { 
       weight += topicCounts[index] >> topicBits; 
       break; 
      } 
      index++; 
     } 
     double[] tokenAndWeight = { (double) type, weight }; 
     tokenWeights.add(tokenAndWeight); 
    } 
    // normalize 
    double sum = 0; 
    // get the sum 
    for (double[] tokenAndWeight : tokenWeights) { 
     sum += tokenAndWeight[1]; 
    } 
    // divide each element by the sum 
    ArrayList<double[]> normalizedTokenWeights = new ArrayList<double[]>(); 
    for (double[] tokenAndWeight : tokenWeights) { 
     tokenAndWeight[1] = tokenAndWeight[1]/sum; 
     normalizedTokenWeights.add(tokenAndWeight); 
    } 
    return normalizedTokenWeights; 
}

來源

2016-05-27 tkja

這看起來像它會工作，但我對風格有一些評論。

我對使用double數組來表示主題/權重對沒有興趣。如果你遍歷所有類型，爲什麼不使用類型爲索引的稠密double[]數組？如果您需要使用另一種方法對此條目之外的條目進行排序，ArrayList可能有意義，但非規範化的中間條目ArrayList似乎很浪費。

第二個求和循環看起來沒有必要。您可以首先將sum初始化爲numTypes * beta，然後僅當您遇到非零計數類型時才添加weight - beta。

如果您定義了normalizer = 1.0/sum，然後在規範化循環中進行乘法而不是除法，它通常會引起顯着差異。

來源

2016-06-10 15:49:06

回答

相關問題