2017-06-22 83 views
1

在Mallet中訓練數據時,處理由於OutOfMemoryError而停止。 bin/mallet中的屬性MEMORY已被設置爲3GB。培訓文件output.mallet的大小僅爲31 MB。我試圖減少訓練數據的大小。但它仍然拋出了同樣的錯誤:Mallet:OutOfMemoryError:Java堆空間

[email protected]:~/dev/test_models/Mallet$ bin/mallet train-classifier --input output.mallet --trainer NaiveBayes --training-portion 0.0001 --num-trials 10 
Training portion = 1.0E-4 
Unlabeled training sub-portion = 0.0 
Validation portion = 0.0 
Testing portion = 0.9999 

-------------------- Trial 0 -------------------- 

Trial 0 Training NaiveBayesTrainer with 7 instances 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 
     at cc.mallet.types.Multinomial$Estimator.setAlphabet(Multinomial.java:309) 
     at cc.mallet.classify.NaiveBayesTrainer.setup(NaiveBayesTrainer.java:251) 
     at cc.mallet.classify.NaiveBayesTrainer.trainIncremental(NaiveBayesTrainer.java:200) 
     at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:193) 
     at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:59) 
     at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:415) 

我想任何並欣賞幫助或見解這個問題

編輯:這是我的斌/槌文件。

#!/bin/bash 


malletdir=`dirname $0` 
malletdir=`dirname $malletdir` 

cp=$malletdir/class:$malletdir/lib/mallet-deps.jar:$CLASSPATH 
#echo $cp 

MEMORY=10g 

CMD=$1 
shift 

help() 
{ 
cat <<EOF 
Mallet 2.0 commands: 

    import-dir   load the contents of a directory into mallet instances (one per file) 
    import-file  load a single file into mallet instances (one per line) 
    import-svmlight load SVMLight format data files into Mallet instances 
    info    get information about Mallet instances 
    train-classifier train a classifier from Mallet data files 
    classify-dir  classify data from a single file with a saved classifier 
    classify-file  classify the contents of a directory with a saved classifier 
    classify-svmlight classify data from a single file in SVMLight format 
    train-topics  train a topic model from Mallet data files 
    infer-topics  use a trained topic model to infer topics for new documents 
    evaluate-topics estimate the probability of new documents under a trained model 
    prune    remove features based on frequency or information gain 
    split    divide data into testing, training, and validation portions 
    bulk-load   for big input files, efficiently prune vocabulary and import docs 

Include --help with any option for more information 
EOF 
} 

CLASS= 

case $CMD in 
     import-dir) CLASS=cc.mallet.classify.tui.Text2Vectors;; 
     import-file) CLASS=cc.mallet.classify.tui.Csv2Vectors;; 
     import-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Vectors;; 
     info) CLASS=cc.mallet.classify.tui.Vectors2Info;; 
     train-classifier) CLASS=cc.mallet.classify.tui.Vectors2Classify;; 
     classify-dir) CLASS=cc.mallet.classify.tui.Text2Classify;; 
     classify-file) CLASS=cc.mallet.classify.tui.Csv2Classify;; 
     classify-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Classify;; 
     train-topics) CLASS=cc.mallet.topics.tui.TopicTrainer;; 
     infer-topics) CLASS=cc.mallet.topics.tui.InferTopics;; 
     evaluate-topics) CLASS=cc.mallet.topics.tui.EvaluateTopics;; 
     prune) CLASS=cc.mallet.classify.tui.Vectors2Vectors;; 
     split) CLASS=cc.mallet.classify.tui.Vectors2Vectors;; 
     bulk-load) CLASS=cc.mallet.util.BulkLoader;; 
     run) CLASS=$1; shift;; 
     *) echo "Unrecognized command: $CMD"; help; exit 1;; 
esac 

java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "[email protected]" 

還值得一提的是,我的原始培訓文件有60,000項。當我減少項目數量(20,000個實例)時,培訓將像正常一樣運行,但使用大約10GB RAM。

+0

您準確更改了哪個文件? bin/mallet或bin/mallet.sh? – mikep

+0

bin/mallet和bin/mallet.bat –

+0

是在其中一個對java的調用? – mikep

回答

1

檢查bin/mallet中對Java的調用,並添加標誌-Xmx3g,確保其中沒有另一個Xmx;如果是的話,編輯一個)。