2016-10-20 57 views
1

問:是否有一個隨機森林示例將火車和測試集分開?我在Accord-Net ML測試項目中找到的當前示例使用了相同的數據進行培訓和測試。如何在Accord.Net中同步火車和測試代碼簿

顯然我遇到的問題是同步測試和訓練集中生成的標籤(整數)。我生成列車標籤用作例如:

int[] trainOutputs = trainCodebook.Translate("Output", trainLabels); 

And the test labels similarly: 

int[] testOutputs = testCodebook.Translate("Output", testLabels); 

Finally I train with the train data and test with the test data: 

var forest = teacher.Learn(trainVectors, trainOutputs); 

int[] predicted = forest.Decide(testVectors); 

除非前三行是在列車和測試都相同的設定標記是不同的,並且它相應地它產生一個非常高的誤差率。

我試圖簡單地手動創建我的碼本三元字符串:

new Codification("-1","0","1"); 

不幸的是這將產生一個運行時錯誤,指出給定的關鍵是不是在字典。我確信有一種方法可以在兩個獨立的碼本中同步密鑰生成。我可以使它與下面的代碼一起工作如果我將我的列車數據的三行(包含所有三個鍵)添加到測試數據的頂部。不是我的首選解決方案; =)

這裏是我運行整個測試:

[Test] 
public void test_learn() 
{ 
Accord.Math.Random.Generator.Seed = 1; 

    /////////// TRAINING SET /////////// 
    // First, let's load the TRAINING set into an array of text that we can process 
    string[][] text = Resources.train.Split(new[] { "\r\n" }, 
     StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(',')); 

    int length = text[0].Length; 
    List<int> columns = new List<int>(); 
    for (int i = 1; i < length; i++) 
    { 
     columns.Add(i); 
    } 
    double[][] trainVectors = text.GetColumns(columns.ToArray()).To<double[][]>(); 

    // The first column contains the expected ternary category (i.e. -1, 0, or 1) 
    string[] trainLabels = text.GetColumn(0); 
    var trainCodebook = new Codification("Output", trainLabels); 
    int[] trainOutputs = trainCodebook.Translate("Output", trainLabels); 

    ////////// TEST SET //////////// 

    text = Resources.test.Split(new[] { "\r\n" }, 
     StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(',')); 

    double[][] testVectors = text.GetColumns(columns.ToArray()).To<double[][]>(); 
    string[] testLabels = text.GetColumn(0); 
    var testCodebook = new Codification("Output", testLabels); 
    int[] testOutputs = testCodebook.Translate("Output", testLabels); 

    var teacher = new RandomForestLearning() 
    { 
     NumberOfTrees = 10, 
    }; 

    var forest = teacher.Learn(trainVectors, trainOutputs); 
    int[] predicted = forest.Decide(testVectors); 

    int lineNum = 1; 
    foreach (int prediction in predicted) 
    { 
     Console.WriteLine("Prediction " + lineNum + ": " 
     + trainCodebook.Translate("Output", prediction)); 
     lineNum++; 
    } 
    // I'm using the test vectors to calculate the error rate 
    double error = new ZeroOneLoss(testOutputs).Loss(forest.Decide(testVectors)); 

    Console.WriteLine("Error term is " + error); 

    Assert.IsTrue(error < 0.20); // humble expectations ;-) 
} 
+0

你要真有**只有一個**碼本從訓練集創建的,你應該用它來在訓練前處理數據和* *測試集。 – Cesar

回答

0

好吧,我想通了。看到下面的代碼:

好吧,我認爲我能夠解決它。問題是DecisionTree中序列化的一個錯誤實現。幸運的是我們擁有的代碼 - 見下面的修補程序:

namespace Accord.MachineLearning.DecisionTrees 
{ 
    using System; 
    using System.Collections.Generic; 
    using System.Linq; 
    using System.Text; 
    using System.Threading.Tasks; 
    using System.Data; 
    using System.Runtime.Serialization; 
    using System.Runtime.Serialization.Formatters.Binary; 
    using System.IO; 
    using Accord.Statistics.Filters; 
    using Accord.Math; 
    using AForge; 
    using Accord.Statistics; 
    using System.Threading; 


/// <summary> 
/// Random Forest. 
/// </summary> 
/// 
/// <remarks> 
/// <para> 
/// Represents a random forest of <see cref="DecisionTree"/>s. For 
/// sample usage and example of learning, please see the documentation 
/// page for <see cref="RandomForestLearning"/>.</para> 
/// </remarks> 
/// 
/// <seealso cref="DecisionTree"/> 
/// <seealso cref="RandomForestLearning"/> 
/// 
[Serializable] 
public class RandomForest : MulticlassClassifierBase, IParallel 
{ 
    private DecisionTree[] trees; 
    **[NonSerialized] 
    private ParallelOptions parallelOptions;** 


    /// <summary> 
    /// Gets the trees in the random forest. 
    /// </summary> 
    /// 
    public DecisionTree[] Trees 
    { 
     get { return trees; } 
    } 

    /// <summary> 
    /// Gets the number of classes that can be recognized 
    /// by this random forest. 
    /// </summary> 
    /// 
    [Obsolete("Please use NumberOfOutputs instead.")] 
    public int Classes { get { return NumberOfOutputs; } } 

    /// <summary> 
    /// Gets or sets the parallelization options for this algorithm. 
    /// </summary> 
    /// 
    **public ParallelOptions ParallelOptions { get { return parallelOptions; } set { parallelOptions = value; } }** 

    /// <summary> 
    /// Gets or sets a cancellation token that can be used 
    /// to cancel the algorithm while it is running. 
    /// </summary> 
    /// 
    public CancellationToken Token 
    { 
     get { return ParallelOptions.CancellationToken; } 
     set { ParallelOptions.CancellationToken = value; } 
    } 

    /// <summary> 
    /// Creates a new random forest. 
    /// </summary> 
    /// 
    /// <param name="trees">The number of trees in the forest.</param> 
    /// <param name="classes">The number of classes in the classification problem.</param> 
    /// 
    public RandomForest(int trees, int classes) 
    { 
     this.trees = new DecisionTree[trees]; 
     this.NumberOfOutputs = classes; 
     this.ParallelOptions = new ParallelOptions(); 
    } 

    /// <summary> 
    /// Computes the decision output for a given input vector. 
    /// </summary> 
    /// 
    /// <param name="data">The input vector.</param> 
    /// 
    /// <returns>The forest decision for the given vector.</returns> 
    /// 
    [Obsolete("Please use Decide() instead.")] 
    public int Compute(double[] data) 
    { 
     return Decide(data); 
    } 


    /// <summary> 
    /// Computes a class-label decision for a given <paramref name="input" />. 
    /// </summary> 
    /// <param name="input">The input vector that should be classified into 
    /// one of the <see cref="ITransform.NumberOfOutputs" /> possible classes.</param> 
    /// <returns>A class-label that best described <paramref name="input" /> according 
    /// to this classifier.</returns> 
    public override int Decide(double[] input) 
    { 
     int[] responses = new int[NumberOfOutputs]; 
     Parallel.For(0, trees.Length, ParallelOptions, i => 
     { 
      int j = trees[i].Decide(input); 
      Interlocked.Increment(ref responses[j]); 
     }); 

     return responses.ArgMax(); 
    } 

    [OnDeserializing()] 
    internal void OnDeserializingMethod(StreamingContext context) 
    { 
     this.ParallelOptions = new ParallelOptions(); 
    } 
} 
}