在LDA摺疊在（新文件估計主題）在Java中

我使用槌通過Java使用槌，我不能工作，如何評價一個我已經訓練現有的主題模型，新的文件。在LDA摺疊在（新文件估計主題）在Java中

我最初的代碼來生成我的模型是非常相似的是，在Mallett Developers Guide for Topic Modelling，之後，我只需保存模型作爲一個Java對象。在以後的過程中，我重新加載，從文件的Java對象，通過.addInstances()添加新的實例，然後想評估只針對原來的訓練集中找到主題這些新的實例。

This stats.SE thread提供了一些高層次的建議，但我看不出他們工作到槌框架。

任何幫助非常感謝。

來源

2013-01-03 Ina

而且我發現藏在一個slide-deck from Mallet's lead developer答案：

TopicInferencer inferencer = model.getInferencer(); 
double[] topicProbs = inferencer.getSampledDistribution(newInstance, 100, 10, 10);

來源

2013-01-03 15:15:40 Ina

這是要走的路。另外，如果你想在模型拯救訓練後和測試（讓他們分開）之前加載它，你可以看看這個答案https://stackoverflow.com/a/44379106/1042409 –

推理實際上也是在這個問題提供了example link（最後幾行）上市。

任何有興趣在保存/載入訓練模型，然後用它來推斷新文檔模型分佈整個代碼 - 這裏有一些片段：model.estimate()已完成

後，你有實際的訓練模型所以你可以使用標準的Java序列化ObjectOutputStream它（因爲ParallelTopicModel實現Serializable）：

try { 
    FileOutputStream outFile = new FileOutputStream("model.ser"); 
    ObjectOutputStream oos = new ObjectOutputStream(outFile); 
    oos.writeObject(model); 
    oos.close(); 
} catch (FileNotFoundException ex) { 
    // handle this error 
} catch (IOException ex) { 
    // handle this error 
}

不過請注意，當你推斷，你也需要通過SA通過新的句子（如Instance）我的管道，以預先處理它（tokenzie等），因此，您還需要保存管列表（因爲我們使用的SerialPipe時，可以創建一個實例，然後序列化）：

// initialize the pipelist (using in model training) 
SerialPipes pipes = new SerialPipes(pipeList); 

try { 
    FileOutputStream outFile = new FileOutputStream("pipes.ser"); 
    ObjectOutputStream oos = new ObjectOutputStream(outFile); 
    oos.writeObject(pipes); 
    oos.close(); 
} catch (FileNotFoundException ex) { 
    // handle error 
} catch (IOException ex) { 
    // handle error 
}

在爲了加載模型/管道，並將其用於推斷我們需要反序列化：

private static void InferByModel(String sentence) { 
    // define model and pipeline 
    ParallelTopicModel model = null; 
    SerialPipes pipes = null; 

    // load the model 
    try { 
     FileInputStream outFile = new FileInputStream("model.ser"); 
     ObjectInputStream oos = new ObjectInputStream(outFile); 
     model = (ParallelTopicModel) oos.readObject(); 
    } catch (IOException ex) { 
     System.out.println("Could not read model from file: " + ex); 
    } catch (ClassNotFoundException ex) { 
     System.out.println("Could not load the model: " + ex); 
    } 

    // load the pipeline 
    try { 
     FileInputStream outFile = new FileInputStream("pipes.ser"); 
     ObjectInputStream oos = new ObjectInputStream(outFile); 
     pipes = (SerialPipes) oos.readObject(); 
    } catch (IOException ex) { 
     System.out.println("Could not read pipes from file: " + ex); 
    } catch (ClassNotFoundException ex) { 
     System.out.println("Could not load the pipes: " + ex); 
    } 

    // if both are properly loaded 
    if (model != null && pipes != null){ 

     // Create a new instance named "test instance" with empty target 
     // and source fields note we are using the pipes list here 
     InstanceList testing = new InstanceList(pipes); 
     testing.addThruPipe(
      new Instance(sentence, null, "test instance", null)); 

     // here we get an inferencer from our loaded model and use it 
     TopicInferencer inferencer = model.getInferencer(); 
     double[] testProbabilities = inferencer 
        .getSampledDistribution(testing.get(0), 10, 1, 5); 
     System.out.println("0\t" + testProbabilities[0]); 
    } 
}

出於某種原因，我沒有得到與所加載的模型與原來完全相同的推論 - 但是這對於另一個問題的問題（如果有人知道，雖然，我很高興聽到）

來源

2015-03-20 15:15:40

爲了得到相同的經過連續的推理後，你只需要添加inferencer.setRandomSeed（1）。但是，如果我使用推理器來處理已在模型中使用的文本文檔，則無法獲得相同的主題分佈。 –

感謝您使用有用的代碼片段。但是在性能方面，直接生成推理器同時訓練模型會更好，而不是首先加載整個模型，然後從中獲取推理器。 – phly

@Phauly - 有時你想訓練一個模型，然後再使用它。在這種情況下，如果有必要培訓模型，並在必要時將其保存用於推理（如果我理解了您的評論） –

在LDA摺疊在（新文件估計主題）在Java中

回答

相關問題