如何在Spark中將多個列合併爲一個向量值列？

我剛開始使用Spark的MLlib。我想訓練一個簡單的模型（例如邏輯迴歸）。我的期望是，我需要「告訴」模型使用哪個列作爲目標，哪些列要視爲特徵。如何在Spark中將多個列合併爲一個向量值列？

但是，它看起來應該只有一列具有特徵（包含向量作爲值的列）。

所以，我的問題是：如何構建這樣一個向量值列？我曾嘗試以下（但不工作）：

df = df.withColumn('feat_vec', [df['_c0'], df['_c1'], df['_c1'], df['_c3'], df['_c4']])

ADDED

我也試過這樣：

from pyspark.ml.feature import VectorAssembler 
assembler = VectorAssembler(inputCols=['_c0', '_c1', '_c2', '_c3', '_c4'], outputCol='feat_vec') 
df = assembler.transform(df)

至於結果我收到以下錯誤信息：

pyspark.sql.utils.IllegalArgumentException: u'Data type StringType is not supported.'

來源

2017-06-14 Roman

我想你錯了。看看[這裏]（https://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark）。 –

在這裏檢查我的答案VectorAssembler：https://stackoverflow.com/questions/43355341/spark-pipeline-error/43378263#43378263 – TDrabas

我不確定這是問題在這裏@TDrabas – eliasah

使用VectorAssembler是走。在linalg.Vector中，您只能有Double值。您需要在Pipeline中添加StringIndexer + OneHotEncoder。然後，您可以在新生成的列上使用匯編器

E.G. （來自鏈接）

from pyspark.ml.feature import OneHotEncoder, StringIndexer 

df = spark.createDataFrame([ 
    (0, "a"), 
    (1, "b"), 
    (2, "c"), 
    (3, "a"), 
    (4, "a"), 
    (5, "c") 
], ["id", "category"]) 

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
model = stringIndexer.fit(df) 
indexed = model.transform(df) 

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec") 
encoded = encoder.transform(indexed) 
encoded.show()

P.S.請看看Pipelines

來源

2017-06-14 16:09:34 Gevorg

從你的答案我已經學到了一些有用的東西（基本上如何在Spark中做一個熱門的編碼），但它沒有提供我的問題的答案。我沒有分類功能。我擁有的功能是數字（儘管它們已被表示爲字符串）。 – Roman

也許我誤解了這個問題。但是，如果你的特性是數字類型並且只有字符串類型，你能不能在將它們傳遞給VectorAssembler之前將它們轉換爲Double類型？你能否在問題中添加一些示例數據？ – Gevorg

你是對的。這就是爲什麼VectorAssembler不起作用的原因。首先，我不知道這些值是字符串。其次，我不知道他們必須是雙人或浮動。 – Roman

如何在Spark中將多個列合併爲一個向量值列？

回答

相關問題