Mallet

2016-05-27 48 views
2

中每個主題的詞語分佈p(w | t)我需要獲得以Java編程的Mallet找到的每個主題的單詞分佈(不在CLI中,如how to get a probability distribution for a topic in mallet?中所述)。對於我的意思的例子:Introduction to Latent Dirichlet AllocationMallet

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) 
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals) 

槌提供每個主題令牌「砝碼」,並在http://comments.gmane.org/gmane.comp.ai.mallet.devel/2064有人試圖寫一個方法讓每個主題字槌的分佈。

我修改了這個方法,使得所有的權重除以上面郵件列表中討論的總和。

以下方法(當添加到ParallelTopicModel.java時)是否正確計算Mallet中每個主題詞(p | w | t)的分佈?

/** 
* Get the normalized topic word weights (weights sum up to 1.0) 
* @param topic the topic 
* @return the normalized topic word weights (weights sum up to 1.0) 
*/ 
public ArrayList<double[]> getNormalizedTopicWordWeights(int topic) { 
    ArrayList<double[]> tokenWeights = new ArrayList<double[]>(); 
    for (int type = 0; type < numTypes; type++) { 
     int[] topicCounts = typeTopicCounts[type]; 
     double weight = beta; 
     int index = 0; 
     while (index < topicCounts.length && topicCounts[index] > 0) { 
      int currentTopic = topicCounts[index] & topicMask; 
      if (currentTopic == topic) { 
       weight += topicCounts[index] >> topicBits; 
       break; 
      } 
      index++; 
     } 
     double[] tokenAndWeight = { (double) type, weight }; 
     tokenWeights.add(tokenAndWeight); 
    } 
    // normalize 
    double sum = 0; 
    // get the sum 
    for (double[] tokenAndWeight : tokenWeights) { 
     sum += tokenAndWeight[1]; 
    } 
    // divide each element by the sum 
    ArrayList<double[]> normalizedTokenWeights = new ArrayList<double[]>(); 
    for (double[] tokenAndWeight : tokenWeights) { 
     tokenAndWeight[1] = tokenAndWeight[1]/sum; 
     normalizedTokenWeights.add(tokenAndWeight); 
    } 
    return normalizedTokenWeights; 
} 

回答

1

這看起來像它會工作,但我對風格有一些評論。

我對使用double數組來表示主題/權重對沒有興趣。如果你遍歷所有類型,爲什麼不使用類型爲索引的稠密double[]數組?如果您需要使用另一種方法對此條目之外的條目進行排序,ArrayList可能有意義,但非規範化的中間條目ArrayList似乎很浪費。

第二個求和循環看起來沒有必要。您可以首先將sum初始化爲numTypes * beta,然後僅當您遇到非零計數類型時才添加weight - beta

如果您定義了normalizer = 1.0/sum,然後在規範化循環中進行乘法而不是除法,它通常會引起顯着差異。