2014-07-21 80 views
11

我們計劃將Apache Pig代碼移至新的Spark平臺。如何在Spark中實現「交叉連接」?

豬有一個「Bag/Tuple/Field」的概念,其行爲與關係型數據庫相似。 Pig提供對CROSS/INNER/OUTER連接的支持。

對於CROSS JOIN,我們可以使用alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

但是,當我們轉移到星火平臺我找不到星火API中的任何對手。你有什麼主意嗎?

+0

這還沒有準備好,但叉勺(上火花豬)正在建設中目前,所以你可能不需要改變你的任何代碼 – aaronman

回答

18

它是oneRDD.cartesian(anotherRDD)

+0

謝謝,笛卡爾連接是交叉連接的暱稱 –

2

這裏是Spark 2.x的數據集和DataFrames推薦的版本:

scala> val ds1 = spark.range(10) 
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint] 

scala> ds1.cache.count 
res1: Long = 10 

scala> val ds2 = spark.range(10) 
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint] 

scala> ds2.cache.count 
res2: Long = 10 

scala> val crossDS1DS2 = ds1.crossJoin(ds2) 
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint] 

scala> crossDS1DS2.count 
res3: Long = 100 

或者,可以使用傳統的JOIN語法沒有連接條件。使用此配置選項可避免以下錯誤。

scala> val crossDS1DS2 = ds1.join(ds2) 
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint] 

scala> crossDS1DS2.count 
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans 
... 
Join condition is missing or trivial. 
Use the CROSS JOIN syntax to allow cartesian products between these relations.; 

相關:當該配置被省略(使用「加入」語法專)

spark.conf.set("spark.sql.crossJoin.enabled", true) 

錯誤spark.sql.crossJoin.enabled for Spark 2.x