2012-02-25 88 views
0

我使用Mallet樸素貝葉斯算法對大數據集進行分類。我的問題是如何將我的數據集分割成火車和測試塊? 任何人都可以告訴我火車測試拆分的最佳方法嗎? 我的文檔按日期排序。 我發現列車測試分裂這個方法:Train-Test Split +文本分類+樸素貝葉斯

public Trial testTrainSplit(InstanceList instances) { 

    int TRAINING = 0; 
    int TESTING = 1; 
    int VALIDATION = 2; 

    // Split the input list into training (90%) and testing (10%) lists.        
// The division takes place by creating a copy of the list,           
// randomly shuffling the copy, and then allocating            
// instances to each sub-list based on the provided proportions.         

    InstanceList[] instanceLists = 
     instances.split(new Randoms(), 
        new double[] {0.9, 0.1, 0.0}); 

// The third position is for the "validation" set,             
    // which is a set of instances not used directly             
    // for training, but available for determining              
    // when to stop training and for estimating optimal            
// settings of nuisance parameters.                
// Most Mallet ClassifierTrainers can not currently take advantage         
    // of validation sets.                    

Classifier classifier = trainClassifier(instanceLists[TRAINING]); 
    return new Trial(classifier, instanceLists[TESTING]); 
} 

,但我認爲這是不恰當的,其中的文件按日期排序的情況。 任何人都可以幫助我嗎?

回答