下面是你想做的事,但在Scala中,我希望你可以將其轉換爲pyspark什麼例子
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(
(1, "abc"),
(2, "def"),
(3, "hij")
)).toDF("id", "name")
val df2 = spark.sparkContext.parallelize(Seq(
(19, "x"),
(29, "y"),
(39, "z")
)).toDF("age", "address")
val schema = StructType(df1.schema.fields ++ df2.schema.fields)
val df1df2 = df1.rdd.zip(df2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
spark.createDataFrame(df1df2, schema).show()
這是你怎麼做只使用數據幀
import org.apache.spark.sql.functions._
val ddf1 = df1.withColumn("row_id", monotonically_increasing_id())
val ddf2 = df2.withColumn("row_id", monotonically_increasing_id())
val result = ddf1.join(ddf2, Seq("row_id")).drop("row_id")
result.show()
添加新列row_id
並加入這兩個數據幀的密鑰爲row_id
。
希望這會有所幫助!
這是否有幫助? –
由於數據幀比RDD更快,我們只能使用火花數據幀來做到這一點? –
這是聯合列明智還是行明智 –