2016-11-10 90 views
2

將分類變量(字符串和整數)包含到MLlib算法的特徵中的正確或最佳方法是什麼?Spark MLlib:包括分類特徵

在分類變量上使用OneHotEncoder s是否正確,然後將其他列的輸出列包含在VectorAssembler中,如下面的代碼中所示?

的原因是,我最終像這樣與行的數據幀中,它看起來像feature3feature4組合看起來他們是作爲單獨的兩個分類功能同等重要的「等級」。

+------------------+-----------------------+---------------------------+ 
|prediction  |actualVal |features        | 
+------------------+-----------------------+---------------------------+ 
|355416.44924898935|990000.0 |(17,[0,1,2,3,4,5,10,15],[1.0,206.0]) | 
|358917.32988024893|210000.0 |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) | 
|291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) | 

這裏是我的代碼:

val indexer = new StringIndexer() 
    .setInputCol("stringFeatureCode") 
    .setOutputCol("stringFeatureCodeIndex") 
    .fit(data) 
val indexed = indexer.transform(data) 

val encoder = new OneHotEncoder() 
    .setInputCol("stringFeatureCodeIndex") 
    .setOutputCol("stringFeatureCodeVec") 

var encoded = encoder.transform(indexed) 

encoded = encoded.withColumn("intFeatureCodeTmp", encoded.col("intFeatureCode") 
    .cast(DoubleType)) 
    .drop("intFeatureCode") 
    .withColumnRenamed("intFeatureCodeTmp", "intFeatureCode") 

val intFeatureCodeEncoder = new OneHotEncoder() 
    .setInputCol("intFeatureCode") 
    .setOutputCol("intFeatureCodeVec") 

encoded = intFeatureCodeEncoder.transform(encoded) 

val assemblerDeparture = 
    new VectorAssembler() 
    .setInputCols(
     Array("stringFeatureCodeVec", "intFeatureCodeVec", "feature3", "feature4")) 
    .setOutputCol("features") 
var data2 = assemblerDeparture.transform(encoded) 

val Array(trainingData, testData) = data2.randomSplit(Array(0.7, 0.3)) 

val rf = new RandomForestRegressor() 
    .setLabelCol("actualVal") 
    .setFeaturesCol("features") 
    .setNumTrees(100) 

回答

1
  • 一般來說,這是一個推薦的方法。
  • 當工作樹模型是不必要的,應該避免。您只能使用StringIndexer
+0

這是什麼意思?僅限StringIndexer?如何將索引列提供給決策樹?他們採取一列特徵向量... – rjurney