dataframe into dense vector火花

Acyally正在處理spark 2.0.2 我想知道，例如基於Spark ML的邏輯迴歸工作。我想將數據幀的每一行都放到一個向量中作爲邏輯迴歸的輸入，你可以幫助獲得行數據框中的每一行到一個密集的向量。謝謝。我在做什麼來獲取數據幀。dataframe into dense vector火花

import org.apache.spark.ml.classification.LogisticRegression 
import org.apache.spark.ml.linalg.{Vector, Vectors} 
import org.apache.spark.ml.param.ParamMap 
import org.apache.spark.sql.SparkSession 
import org.apache.spark.sql.Row 
import org.apache.hadoop.fs.shell.Display 

object Example extends App { 
val sparkSession = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() 
val data=sparkSession.read.option("header", "true").csv("C://sample_lda_data.csv").toDF() 
val data2=data.select("col2","col3","col4","col5","col6","col7","col8","col9")

在我想這樣的事情作爲輸入邏輯迴歸在第一位置將是數據框的第一列的任何幫助，請

val data=sparkSession.read.option("header", "true").csv("C://sample_lda_data.csv").toDF() 
val data2=data.select("col2","col3","col4","col5","col6","col7","col8","col9") 
val assembler = new VectorAssembler().setInputCols(Array("col2", "col3", "col4")).setOutputCol("features") 
val output = assembler.transform(data2) 

main" java.lang.IllegalArgumentException: Data type StringType is not supported.

我將結束所以gratefull.Thank你們

來源

2017-01-02 Hattabi Maher

你可以使用[VectorAssembler]（https://spark.apache.org/docs/2.0.2/ml-features.html#vectorassembler）。 – mtoto

@mtoto我用你說什麼，我編輯我得到這個錯誤代碼主」 java.lang.IllegalArgumentException異常：數據類型StringType不supported.Any幫助 –

所有的cols應該是數字 – mtoto

可以使用array功能，然後映射到LabeledPoint S：

import org.apache.spark.mllib.linalg.Vectors 
import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types.DoubleType 

// create an array column from all but first one: 
val arrayCol: Column = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*) 

// select array column and first column, and map into LabeledPoints 
val result: Dataset[LabeledPoint] = df.select(col("col1").cast(DoubleType), arrayCol) 
    .map(r => LabeledPoint(
    r.getAs[Double](0), 
    Vectors.dense(r.getAs[mutable.WrappedArray[Double]](1).toArray) 
)) 

// You can use the Dataset or the RDD 
result.show() 
// +-----+---------------------+ 
// |label|features    | 
// +-----+---------------------+ 
// |1.0 |[2.0,3.0,4.0,0.5] | 
// |11.0 |[12.0,13.0,14.0,15.0]| 
// |21.0 |[22.0,23.0,24.0,25.0]| 
// +-----+---------------------+ 

result.rdd.foreach(println) 
// (1.0,[2.0,3.0,4.0,0.5]) 
// (21.0,[22.0,23.0,24.0,25.0])

來源

2017-01-02 14:10:42

請你用什麼樣的導入包日Thnx您的幫助，我再次想你的代碼日Thnx –

對不起我的朋友是新來的Scala和火花，我得到一個錯誤，告訴我，$不是StringContext日Thnx的成員事先 –

哦，這是另一個缺少的import（'進口sparkSession .implicits._'），添加或用'col（「col1」）''替換'$「col1」'' –

I have wrote code to convert dataframe's numeric columns into dense vector. Please find below code. Note: here col1 and col2 are numeric type columns. 

import sparksession.implicits._; 
    val result: Dataset[LabeledPoint] = df.map{ x => LabeledPoint(x.getAs[Integer]("Col1").toDouble, Vectors.dense(x.getAs[Double]("col2"))) } 
    result.show(); 
result.printSchema(); 

+-------+----------+ 
| label| features| 
+-------+----------+ 
|31825.0| [75000.0]| 
|58784.0| [24044.0]| 
| 121.0| [41000.0]| 

root 
|-- label: double (nullable = true) 
|-- features: vector (nullable = true)

來源

2017-12-28 12:51:32

dataframe into dense vector火花

回答

相關問題