這是你可以做的,這裏是一個簡單的例子
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data
//get list of cities
val city = data.select("city").distinct.collect().flatMap(_.toSeq)
// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))
//print all the dataframes
result.foreach(a=>
println(s"Dataframe with ${a._1}")
a._2.show()
})
輸出如下所示
數據幀與市1
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 1| 84|
| 27|City 1| 89|
+----+------+----+
數據幀與市3
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 3| 48|
| 27|City 3| 42|
| 26|City 3| 68|
| 29|City 3| 27|
+----+------+----+
數據幀與市4
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 4| 104|
+----+------+----+
數據幀與市2
+----+------+----+
|week| city|sale|
+----+------+----+
| 29|City 2| 72|
| 28|City 2| 19|
| 27|City 2| 16|
| 26|City 2| 19|
+----+------+----+
您還可以使用partitionby
對數據進行分組,並寫入到輸出
dataframe.write.partitionBy("col").saveAsTable("outputpath")
這造成"col"
希望這有助於進行分組對每一個輸出文件!
謝謝 - 這是完美的。建議的副本也回答了我的問題,但我已經接受了您的答案。 – thebigdog
非常感謝接受@ thebigdog :) –