如何使用Spark-Scala中定義的模式創建行？

我想創建一個具有案例類模式的行來測試我的一個地圖函數。我能想到這樣做的最直接的方法是：如何使用Spark-Scala中定義的模式創建行？

import org.apache.spark.sql.Row 

case class MyCaseClass(foo: String, bar: Option[String]) 

def buildRowWithSchema(record: MyCaseClass): Row = { 
    sparkSession.createDataFrame(Seq(record)).collect.head 
}

然而，這似乎是一個很大的開銷，只是得到一個單列的，所以我看着我怎麼能直接創建一排的模式。這導致我：

import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 
import org.apache.spark.sql.{Encoders, Row} 

def buildRowWithSchemaV2(record: MyCaseClass): Row = { 
    val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { 
     field.setAccessible(true) 
     field.get(record) 
    }) 
    new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) 
}

不幸的是，第二個版本返回的行與第一行不同。第一個版本中的選項字段被縮減爲原始值，而它們仍然是第二個版本中的選項。另外，第二個版本相當笨拙。

有沒有更好的方法來做到這一點？

來源

2017-05-05 Rahul Gupta-Iwasaki

第二個版本返回Option本身的bar大小寫字段，因此您沒有得到原始值作爲第一個版本。您可以使用下面的代碼爲原始值

def buildRowWithSchemaV2(record: MyCaseClass): Row = { 
    val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { 
    field.setAccessible(true) 
    val returnValue = field.get(record) 
    if(returnValue.isInstanceOf[Option[String]]){ 
     returnValue.asInstanceOf[Option[String]].get 
    } 
    else 
     returnValue 
    }) 
    new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) 
}

但同時，我會建議你使用DataFrame或DataSet。

DataFrame和DataSet本身是Row with schema的集合。
所以，當你有一個case class定義的，你只需要輸入數據encode到case class 例如：可以說你有輸入數據作爲

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))

如果你有一個文本文件，你可以閱讀sparkContext.textFile和split根據您的需要。
現在，當你已經將您的數據RDD，將其轉換爲dataframe或dataset是兩行代碼

import sqlContext.implicits._ 
val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF

.toDS會產生dataset 因此，你有Rows with schema
集合驗證你可以做以下

println(dataFrame.schema) //for checking if there is schema 

println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows

希望你有正確的答案。

來源

2017-05-06 06:57:17

'take（1）'雖然比'collect.head'更高效，但沒有多大區別。我相信OP想要完全跳過RDD的創建。 –

@HristoIliev，我只是爲了驗證/顯示'dataframe'或'dataset'是'schema'的'Row'的集合。 –

但是OP需要一個'Row'，而不是'DataSet'或'DataFrame'。現在，我不太清楚爲什麼，因爲他可以使用單個元素輕鬆測試分佈式集合上的映射函數。 –

如何使用Spark-Scala中定義的模式創建行？

回答

相關問題