獲取槌球中所有文檔的實例和主題序列

我正在使用槌球庫進行主題建模。我的數據集是在filePath路徑和csvIterator似乎可以讀取數據，因爲model.getData（）有大約27000行等於我的數據集。我寫了一個循環，打印10個第一個文檔的實例和主題序列，但記號的大小是0.我哪裏出錯了？獲取槌球中所有文檔的實例和主題序列

在下面，我想顯示前10個主題中的前10個詞的比例，但所有輸出都是相同的。在cosole出

例如：

----文檔0

0 0.200 COM（1723年）的twitter（1225）的http（871）CBR（688）堪培拉（626）

1個0.200 COM（981）的twitter（901）天（205）可以（159）星期三（156）

2 0.200的twitter（1068）的COM（947）動作（433）actvcc（317）堪培拉（302）

3 0.200 http（1039）can貝拉（841）職位（378）dlvr（313）的COM（228）

4 0.200 COM（1185）WWW（1074）HTTP（831）新聞（708）canberratimes（560）

----文獻1

0 0.200 COM（1723年）的twitter（1225）的http（871）CBR（688）堪培拉（626）

1 0.200 COM（981）的twitter（901）天（205）可以（159）（156）

2 0.200 twitter（1068）com（947）act（433）actvcc（317）canberra（302）

3 0.200 HTTP（1039）堪培拉（841）職位（378）dlvr（313）的COM（228）

4 0.200 COM（1185）WWW（1074）HTTP（831）新聞（708）canberratimes（560 ）

據我所知，LDA模型生成每個文檔並將它們分配給主題的單詞。那麼爲什麼每個文件的結果都是一樣的？

ArrayList<Pipe> pipeList = new ArrayList<Pipe>(); 
    pipeList.add(new CharSequenceLowercase()); 
    pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}"))); 
    //stoplists/en.txt 
    pipeList.add(new TokenSequenceRemoveStopwords(new File(pathStopWords), "UTF-8", false, false, false)); 
    pipeList.add(new TokenSequence2FeatureSequence()); 

    InstanceList instances = new InstanceList(new SerialPipes(pipeList)); 

    Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8"); 
//header of my data set 
// row,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment 
    CsvIterator csvIterator = new CsvIterator(fileReader, 
      Pattern.compile("^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$"), 
      2, 0, 1); 
    instances.addThruPipe(csvIterator); // data, label, name fields 

    int numTopics = 5; 
    ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01); 

    model.addInstances(instances); 

    model.setNumThreads(2); 


    model.setNumIterations(50); 
    model.estimate(); 

    Alphabet dataAlphabet = instances.getDataAlphabet(); 
    ArrayList<TopicAssignment> arrayTopics = model.getData(); 

    for (int i = 0; i < 10; i++) { 
     System.out.println("---- document " + i); 
     FeatureSequence tokens = (FeatureSequence) model.getData().get(i).instance.getData(); 
     LabelSequence topics = model.getData().get(i).topicSequence; 

     Formatter out = new Formatter(new StringBuilder(), Locale.US); 
     for (int position = 0; position < tokens.getLength(); position++) { 
      out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), 
        topics.getIndexAtPosition(position)); 
     } 
     System.out.println(out); 

     double[] topicDistribution = model.getTopicProbabilities(i); 

     ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords(); 


     for (int topic = 0; topic < numTopics; topic++) { 
      Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator(); 
      out = new Formatter(new StringBuilder(), Locale.US); 
      out.format("%d\t%.3f\t", topic, topicDistribution[topic]); 
      int rank = 0; 
      while (iterator.hasNext() && rank < 5) { 
       IDSorter idCountPair = iterator.next(); 
       out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight()); 
       rank++; 
      } 
      System.out.println(out); 
     } 

     StringBuilder topicZeroText = new StringBuilder(); 
     Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator(); 

     int rank = 0; 
     while (iterator.hasNext() && rank < 5) { 
      IDSorter idCountPair = iterator.next(); 
      topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " "); 
      rank++; 
     } 

    }

來源

2017-10-20 NASRIN

這些主題是在模型級別而不是在文檔級別定義的。他們應該是相同的。

它看起來像所有的文字都是網址。將PrintInputPipe添加到導入序列中可能有助於調試。

來源

2017-10-22 21:02:01

獲取槌球中所有文檔的實例和主題序列

回答

相關問題