這裏是Spark 2.x的數據集和DataFrames推薦的版本:
scala> val ds1 = spark.range(10)
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds1.cache.count
res1: Long = 10
scala> val ds2 = spark.range(10)
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds2.cache.count
res2: Long = 10
scala> val crossDS1DS2 = ds1.crossJoin(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
res3: Long = 100
或者,可以使用傳統的JOIN語法沒有連接條件。使用此配置選項可避免以下錯誤。
scala> val crossDS1DS2 = ds1.join(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
...
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
相關:當該配置被省略(使用「加入」語法專)
spark.conf.set("spark.sql.crossJoin.enabled", true)
錯誤spark.sql.crossJoin.enabled for Spark 2.x
這還沒有準備好,但叉勺(上火花豬)正在建設中目前,所以你可能不需要改變你的任何代碼 – aaronman