2
我注意到毫升StandardScaler
不會附着的元數據以輸出列:爲什麼StandardScaler不將元數據附加到輸出列?
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
val df = spark.read.option("header", true)
.option("inferSchema", true)
.csv("/path/to/cars.data")
val strId1 = new StringIndexer()
.setInputCol("v7")
.setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
.setInputCol("v8")
.setOutputCol("v8_IDX")
val assmbleFeatures: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
.setOutputCol("featuresRaw")
val scalerModel = new StandardScaler()
.setInputCol("featuresRaw")
.setOutputCol("scaledFeatures")
val plm = new Pipeline()
.setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
.fit(df)
val dft = plm.transform(df)
dft.schema("scaledFeatures").metadata
給出:
res1: org.apache.spark.sql.types.Metadata = {}
此示例適用於this dataset(只是適應上述代碼路徑)。
這是否有特定的原因?這個功能可能會在未來添加到Spark嗎?對於不包含重複StandardScaler的解決方法的任何建議?
這就是我不認爲還有一好點 - 謝謝你! – aMKa