2014-06-28 59 views
6

我想實現使用java和apache火花1.0.0版本的決策樹分類器的簡單演示。我基於http://spark.apache.org/docs/1.0.0/mllib-decision-tree.html。到目前爲止,我編寫了下面列出的代碼。與java的apache的火花決策樹實現問題

與下面的代碼行,我得到錯誤:

org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); 

類型不匹配:不能轉換從熵的雜質。 真奇怪,我,一邊類熵實現雜質接口:

https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/mllib/tree/impurity/Entropy.html

我在找問題,爲什麼我不能做這個作業的答案嗎?

package decisionTree; 

import java.util.regex.Pattern; 

import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.linalg.Vectors; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.tree.DecisionTree; 
import org.apache.spark.mllib.tree.configuration.Algo; 
import org.apache.spark.mllib.tree.configuration.Strategy; 
import org.apache.spark.mllib.tree.impurity.Gini; 
import org.apache.spark.mllib.tree.impurity.Impurity; 

import scala.Enumeration.Value; 

public final class DecisionTreeDemo { 

    static class ParsePoint implements Function<String, LabeledPoint> { 
     private static final Pattern COMMA = Pattern.compile(","); 
     private static final Pattern SPACE = Pattern.compile(" "); 

     @Override 
     public LabeledPoint call(String line) { 
      String[] parts = COMMA.split(line); 
      double y = Double.parseDouble(parts[0]); 
      String[] tok = SPACE.split(parts[1]); 
      double[] x = new double[tok.length]; 
      for (int i = 0; i < tok.length; ++i) { 
       x[i] = Double.parseDouble(tok[i]); 
      } 
      return new LabeledPoint(y, Vectors.dense(x)); 
     } 
    } 

    public static void main(String[] args) throws Exception { 

     if (args.length < 1) { 
      System.err.println("Usage:DecisionTreeDemo <file>"); 
      System.exit(1); 
     } 

     JavaSparkContext ctx = new JavaSparkContext("local[4]", "Log Analizer", 
       System.getenv("SPARK_HOME"), 
       JavaSparkContext.jarOfClass(DecisionTreeDemo.class)); 

     JavaRDD<String> lines = ctx.textFile(args[0]); 
     JavaRDD<LabeledPoint> points = lines.map(new ParsePoint()).cache(); 

     int iterations = 100; 

     int maxBins = 2; 
     int maxMemory = 512; 
     int maxDepth = 1; 

     org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); 

     Strategy strategy = new Strategy(Algo.Classification(), impurity, maxDepth, 
       maxBins, null, null, maxMemory); 

     ctx.stop(); 
    } 
} 

@samthebest如果刪除雜質變量和更改爲如下形式:改變爲

Strategy strategy = new Strategy(Algo.Classification(), new org.apache.spark.mllib.tree.impurity.Entropy(), maxDepth, maxBins, null, null, maxMemory); 

錯誤:構造熵()是未定義的。

[編輯] 我發現,我認爲方法的正確調用(https://issues.apache.org/jira/browse/SPARK-2197):

Strategy strategy = new Strategy(Algo.Classification(), new Impurity() { 
@Override 
public double calculate(double arg0, double arg1, double arg2) 
{ return Gini.calculate(arg0, arg1, arg2); } 

@Override 
public double calculate(double arg0, double arg1) 
{ return Gini.calculate(arg0, arg1); } 

}, 5, 100, QuantileStrategy.Sort(), null, 256); 

不幸的是我遇到的bug :(

+1

奇數。嘗試將它內聯而不是分配給變量。畢竟你只使用一次變量。也真的推薦使用Scala而不是Java API,你可以用幾行代碼完成整個事情,閱讀起來會更容易。 – samthebest

回答

0
的錯誤2197

一個Java的解決方案現已上市,通過this pull request

Other improvements to Decision Trees for easy-of-use with Java: * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently. I suspect we will redo the API before the other options are included.

你可以看到一個完整的例子,這是使用intance()方法基尼雜質here

Strategy strategy = new Strategy(Algo.Classification(), Gini.instance(), maxDepth, numClasses,maxBins, categoricalFeaturesInfo); 
DecisionTreeModel model = DecisionTree$.MODULE$.train(rdd.rdd(), strategy);