將RDD轉換爲DataFrame

嗨，我是Spark新手，我正在嘗試將rdd轉換爲dataframe.rdd是一個文件夾，其中包含許多.txt文件，並且每個文件都有一段text.Assume我RDD是這個將RDD轉換爲DataFrame

val data = sc.textFile("data")

我想將數據轉換爲數據幀像這樣

+------------+------+ 
    |text  | code | 
    +----+-------+------| 
    |data of txt1| 1.0 | 
    |data of txt2| 1.0 |

所以列「文本」應該讓每個txt文件和原始數據列「代碼「1.0 任何幫助，將不勝感激。

來源

2016-01-28 luis

你甚至試圖看文檔？ – Niemand

val data = sc.textFile("data.txt") 

*// The schema is encoded in a string* 
val schemaString = "text code" 

*// Import Row.* 
import org.apache.spark.sql.Row; 

*// Import Spark SQL data types* 
import org.apache.spark.sql.types.{StructType,StructField,StringType}; 

*// Generate the schema based on the string of schema* 
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) 

*// Convert records of the RDD (data) to Rows.* 
val rowRDD = data.map(_.split(",")).map(p => Row(p(0), p(1).trim)) 

*// Apply the schema to the RDD.* 
val dataDataFrame = sqlContext.createDataFrame(rowRDD, schema) 

*// Register the DataFrames as a table.* 
dataDataFrame.registerTempTable("data") 

*// SQL statements can be run by using the sql methods provided by sqlContext.* 
val results = sqlContext.sql("SELECT name FROM data")

從所有文件中添加數據不是一個好主意，因爲所有的數據都會被加載到內存中。一次只讀一個文件將是更好的方法。

但是，根據您的使用情況，如果您需要所有文件的數據，則需要以某種方式追加rdds。

希望能回答你的問題！乾杯！ :)

來源

2016-01-28 11:29:41 Kaushal

感謝編輯@Sumit – Kaushal

星火SQL可以使用「toDF」方法

http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

在你的情況下做到這一點：

case class Data(text: String, code: Float) 

val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
// this is used to implicitly convert an RDD to a DataFrame. 
import sqlContext.implicits._ 

val data = sc.textFile("data") 
val dataFrame = data.map(d => Data(d._1, d._2._foFloat)).toDF()

來源

2016-01-29 12:55:43 rhernando

將RDD轉換爲DataFrame

回答

相關問題