2016-11-27 115 views
0

我有以下使用決策樹進行分類的代碼。我需要將測試數據集的預測轉化爲java數組並打印出來。有人可以幫我擴展這個代碼。我需要一個預測標籤和實際標籤的二維數組,並打印預測標籤。Apache Spark決策樹預測

public class DecisionTreeClass { 
    public static void main(String args[]){ 
     SparkConf sparkConf = new SparkConf().setAppName("DecisionTreeClass").setMaster("local[2]"); 
     JavaSparkContext jsc = new JavaSparkContext(sparkConf); 


     // Load and parse the data file. 
     String datapath = "/home/thamali/Desktop/tlib.txt"; 
     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();//A training example used in supervised learning is called a 「labeled point」 in MLlib. 
     // Split the data into training and test sets (30% held out for testing) 
     JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3}); 
     JavaRDD<LabeledPoint> trainingData = splits[0]; 
     JavaRDD<LabeledPoint> testData = splits[1]; 

     // Set parameters. 
     // Empty categoricalFeaturesInfo indicates all features are continuous. 
     Integer numClasses = 12; 
     Map<Integer, Integer> categoricalFeaturesInfo = new HashMap(); 
     String impurity = "gini"; 
     Integer maxDepth = 5; 
     Integer maxBins = 32; 

     // Train a DecisionTree model for classification. 
     final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses, 
       categoricalFeaturesInfo, impurity, maxDepth, maxBins); 

     // Evaluate model on test instances and compute test error 
     JavaPairRDD<Double, Double> predictionAndLabel = 
       testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { 
        @Override 
        public Tuple2<Double, Double> call(LabeledPoint p) { 
         return new Tuple2(model.predict(p.features()), p.label()); 
        } 
       }); 

     Double testErr = 
       1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { 
        @Override 
        public Boolean call(Tuple2<Double, Double> pl) { 
         return !pl._1().equals(pl._2()); 
        } 
       }).count()/testData.count(); 

     System.out.println("Test Error: " + testErr); 
     System.out.println("Learned classification tree model:\n" + model.toDebugString()); 


    } 

} 

回答

1

你基本上已經完全與預測和標籤變量。如果你真的需要一個2D雙陣列的列表,你可以改變你使用的方法:

JavaRDD<double[]> valuesAndPreds = testData.map(point -> new double[]{model.predict(point.features()), point.label()}); 

,並就2D雙陣列的列表,參考運行collect

List<double[]> values = valuesAndPreds.collect(); 

我會看看這裏的文檔:https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html。您還可以使用MulticlassMetrics等類來更改數據以獲取模型的其他靜態性能度量。這需要將mapToPair函數更改爲map函數,並將泛型更改爲對象。因此,像:

JavaRDD<Tuple2<Object, Object>> valuesAndPreds = testData().map(point -> new Tuple2<>(model.predict(point.features()), point.label())); 

然後運行:

MulticlassMetrics multiclassMetrics = new MulticlassMetrics(JavaRDD.toRDD(valuesAndPreds)); 

所有的這些東西是星火的MLLib文檔中很好的記錄。另外,你提到需要打印結果。如果這是作業,我會讓你弄清楚這一部分,因爲從列表中學習如何做是一個很好的練習。

編輯:

也注意到,您使用的是Java 7,和我有什麼是從Java 8.要回答如何變成一個二維double數組你的主要問題,你會怎麼做:

JavaRDD<double[]> valuesAndPreds = testData.map(new org.apache.spark.api.java.function.Function<LabeledPoint, double[]>() { 
       @Override 
       public double[] call(LabeledPoint point) { 
        return new double[]{model.predict(point.features()), point.label()}; 
       } 
      }); 

然後運行collect,得到兩個雙打的列表。此外,要給出打印部分的提示,請查看java.util.Arrays toString實現。