2017-01-02 29 views
0

Acyally正在處理spark 2.0.2 我想知道,例如基於Spark ML的邏輯迴歸工作。我想將數據幀的每一行都放到一個向量中作爲邏輯迴歸的輸入,你可以幫助獲得行數據框中的每一行到一個密集的向量。謝謝。我在做什麼來獲取數據幀。dataframe into dense vector火花

import org.apache.spark.ml.classification.LogisticRegression 
import org.apache.spark.ml.linalg.{Vector, Vectors} 
import org.apache.spark.ml.param.ParamMap 
import org.apache.spark.sql.SparkSession 
import org.apache.spark.sql.Row 
import org.apache.hadoop.fs.shell.Display 

object Example extends App { 
val sparkSession = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() 
val data=sparkSession.read.option("header", "true").csv("C://sample_lda_data.csv").toDF() 
val data2=data.select("col2","col3","col4","col5","col6","col7","col8","col9") 

在我想這樣的事情作爲輸入邏輯迴歸 在第一位置將是數據框的第一列的任何幫助,請

val data=sparkSession.read.option("header", "true").csv("C://sample_lda_data.csv").toDF() 
val data2=data.select("col2","col3","col4","col5","col6","col7","col8","col9") 
val assembler = new VectorAssembler().setInputCols(Array("col2", "col3", "col4")).setOutputCol("features") 
val output = assembler.transform(data2) 

main" java.lang.IllegalArgumentException: Data type StringType is not supported. 

我將結束所以gratefull.Thank你們

+0

你可以使用[VectorAssembler](https://spark.apache.org/docs/2.0.2/ml-features.html#vectorassembler)。 – mtoto

+0

@mtoto我用你說什麼,我編輯我得到這個錯誤代碼主」 java.lang.IllegalArgumentException異常:數據類型StringType不supported.Any幫助 –

+1

所有的cols應該是數字 – mtoto

回答

2

可以使用array功能,然後映射到LabeledPoint S:

import org.apache.spark.mllib.linalg.Vectors 
import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types.DoubleType 

// create an array column from all but first one: 
val arrayCol: Column = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*) 

// select array column and first column, and map into LabeledPoints 
val result: Dataset[LabeledPoint] = df.select(col("col1").cast(DoubleType), arrayCol) 
    .map(r => LabeledPoint(
    r.getAs[Double](0), 
    Vectors.dense(r.getAs[mutable.WrappedArray[Double]](1).toArray) 
)) 

// You can use the Dataset or the RDD 
result.show() 
// +-----+---------------------+ 
// |label|features    | 
// +-----+---------------------+ 
// |1.0 |[2.0,3.0,4.0,0.5] | 
// |11.0 |[12.0,13.0,14.0,15.0]| 
// |21.0 |[22.0,23.0,24.0,25.0]| 
// +-----+---------------------+ 

result.rdd.foreach(println) 
// (1.0,[2.0,3.0,4.0,0.5]) 
// (21.0,[22.0,23.0,24.0,25.0]) 
+0

請你用什麼樣的導入包日Thnx您的幫助,我再次想你的代碼日Thnx –

+0

對不起我的朋友是新來的Scala和火花,我得到一個錯誤,告訴我,$不是StringContext日Thnx的成員事先 –

+0

哦,這是另一個缺少的import('進口sparkSession .implicits._'),添加或用'col(「col1」)''替換'$「col1」'' –

0
I have wrote code to convert dataframe's numeric columns into dense vector. Please find below code. Note: here col1 and col2 are numeric type columns. 

import sparksession.implicits._; 
    val result: Dataset[LabeledPoint] = df.map{ x => LabeledPoint(x.getAs[Integer]("Col1").toDouble, Vectors.dense(x.getAs[Double]("col2"))) } 
    result.show(); 
result.printSchema(); 

+-------+----------+ 
| label| features| 
+-------+----------+ 
|31825.0| [75000.0]| 
|58784.0| [24044.0]| 
| 121.0| [41000.0]| 

root 
|-- label: double (nullable = true) 
|-- features: vector (nullable = true)