2016-12-16 18 views

回答

2

的差別是細微的。

如果您例如使用.toDF("name", "age")將未命名的元組("Pete", 22)轉換爲DataFrame,並且還可以通過再次調用toDF方法來重命名該數據幀。例如:

scala> val rdd = sc.parallelize(List(("Piter", 22), ("Gurbe", 27))) 
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] at parallelize at <console>:27 

scala> val df = rdd.toDF("name", "age") 
df: org.apache.spark.sql.DataFrame = [name: string, age: int] 

scala> df.show() 
+-----+---+ 
| name|age| 
+-----+---+ 
|Piter| 22| 
|Gurbe| 27| 
+-----+---+ 

scala> val df = rdd.toDF("person", "age") 
df: org.apache.spark.sql.DataFrame = [person: string, age: int] 

scala> df.show() 
+------+---+ 
|person|age| 
+------+---+ 
| Piter| 22| 
| Gurbe| 27| 
+------+---+ 

使用選擇,你可以選擇列,以後它可以用來預測表,或只保存您需要的列:

scala> df.select("age").show() 
+---+ 
|age| 
+---+ 
| 22| 
| 27| 
+---+ 

scala> df.select("age").write.save("/tmp/ages.parquet") 
Scaling row group sizes to 88.37% for 8 writers. 

希望這有助於!

相關問題