VectorAssembler只輸出到DenseVector？

VectorAssembler的功能非常煩人。我目前正在將一組列轉換爲向量的一列，然後使用StandardScaler函數將縮放應用於所包含的功能。然而，似乎SPARK的內存的原因，決定它應該使用DenseVector還是SparseVector來表示每行功能。但是，當您需要使用StandardScaler時，SparseVector（s）的輸入無效，只允許使用DenseVectors。有人知道解決方案嗎？VectorAssembler只輸出到DenseVector？

編輯： 我決定只使用UDF函數來代替，從而關稀疏矢量成緻密的載體。有點愚蠢，但作品。

來源

2016-03-07 ml_0x

你說得對，VectorAssembler選擇基於無論使用較少內存的密集型還是稀疏型輸出格式。

您不需要UDF便可將SparseVector轉換爲DenseVector;只需使用toArray() method：

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0]) 
b = DenseVector(a.toArray())

此外，StandardScaler接受SparseVector，除非你在創建設置withMean=True。如果你確實需要減小誤差，你必須從所有的組件中推出一個（可能是非零的）數字，所以稀疏矢量不會再稀疏了。

來源

2016-07-26 17:28:02 max

VectorAssembler將其轉換爲稀疏矢量後，可以將其轉換爲稠密矢量。

這裏是我做的，

創建DenseVector案例類

case class vct(features:Vector)

變換稀疏向量列向量密列

val new_df = df.select("sparse vector column").map(x => { vct(x.getAs[org.apache.spark.mllib.linalg.SparseVector](1).toDense)}).toDF()

來源

2017-08-14 20:54:23

VectorAssembler只輸出到DenseVector？

回答

相關問題