2017-06-01 67 views
0

我使用的是NGramTransformer,然後是CountVectorizerModel創建複合變壓器火花

我需要能夠創建一個複合變壓器以備後用。

我能夠做出List<Transformer>並通過所有元素循環來實現這一點,但我想知道是否有可能創建一個Transformer使用2等Transformer

+1

您可以使用管道API從火花毫升 – eliasah

+1

我會看看,多虧 – LonsomeHell

+0

這是對官方的mllib/ml文檔 – eliasah

回答

2

這實際上是很容易的,你只需要使用Pipeline API來創建您的管道:

import java.util.Arrays; 

import org.apache.spark.ml.Pipeline; 
import org.apache.spark.ml.PipelineModel; 
import org.apache.spark.ml.PipelineStage; 
import org.apache.spark.ml.feature.CountVectorizer; 
import org.apache.spark.ml.feature.NGram; 
import org.apache.spark.ml.feature.Tokenizer; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.RowFactory; 
import org.apache.spark.sql.types.DataTypes; 
import org.apache.spark.sql.types.Metadata; 
import org.apache.spark.sql.types.StructField; 
import org.apache.spark.sql.types.StructType; 

List<Row> data = Arrays.asList(
      RowFactory.create(0, "Hi I heard about Spark"), 
      RowFactory.create(1, "I wish Java could use case classes"), 
      RowFactory.create(2, "Logistic,regression,models,are,neat") 
    ); 

StructType schema = new StructType(new StructField[]{ 
      new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), 
      new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) 
}); 

現在讓我們來定義管道(標記生成器,變壓器NGRAM和矢量化數):

Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words"); 

NGram ngramTransformer = NGram().setN(2).setInputCol("words").setOutputCol("ngrams"); 

CountVectorizer countVectorizer = new CountVectorizer() 
    .setInputCol("ngrams") 
    .setOutputCol("feature") 
    .setVocabSize(3) 
    .setMinDF(2); 

現在,我們可以創建管道並訓練它:

Pipeline pipeline = new Pipeline() 
      .setStages(new PipelineStage[]{tokenizer, ngramTransformer, countVectorizer}); 

// Fit the pipeline to training documents. 
PipelineModel model = pipeline.fit(sentenceDataFrame); 

我希望這有助於