2017-04-03 82 views
4

我定義了兩個表是這樣的:Left反加入Spark?

val tableName = "table1" 
    val tableName2 = "table2" 

    val format = new SimpleDateFormat("yyyy-MM-dd") 
     val data = List(
     List("mike", 26, true), 
     List("susan", 26, false), 
     List("john", 33, true) 
    ) 
    val data2 = List(
     List("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)), 
     List("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)), 
     List("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)), 
     List("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)), 
     List("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime)) 
    ) 

     val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_)) 
     val rdd2 = sparkContext.parallelize(data2).map(Row.fromSeq(_)) 
     val schema = StructType(Array(
     StructField("name", StringType, true), 
     StructField("age", IntegerType, true), 
     StructField("isBoy", BooleanType, false) 
    )) 
    val schema2 = StructType(Array(
     StructField("name", StringType, true), 
     StructField("grade", StringType, true), 
     StructField("howold", IntegerType, true), 
     StructField("hobby", StringType, true), 
     StructField("birthday", DateType, false) 
    )) 

     val df = sqlContext.createDataFrame(rdd, schema) 
     val df2 = sqlContext.createDataFrame(rdd2, schema2) 
     df.createOrReplaceTempView(tableName) 
     df2.createOrReplaceTempView(tableName2) 

我試圖建立查詢到從沒有匹配的行表2表1返回行。 我嘗試使用此查詢做到這一點:

Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold AND table2.name IS NULL AND table2.howold IS NULL 

但這只是給了我從表1中的所有行:

列表({「名」:「約翰」,「年齡」: 33,isBoy:true}, {「name」:「susan」,「age」:26,「isBoy」:false}, {「name」:「mike」,「age」:26,「isBoy 「:true})

如何在Spark中有效地進行這種連接?

我正在查找SQL查詢,因爲我需要能夠指定要在兩個表之間進行比較的列,而不僅僅是比較像其他推薦問題那樣逐行比較。像使用減法,除了等。

+1

的可能的複製[火花:減去兩個DataFrames](http://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes) –

+0

根據您的編輯和評論我的答案,我認爲你正在尋找: http://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes值得注意的是@Interfector對第一個回答的評論 –

回答

10

您可以使用「左反」連接類型 - 無論是與數據幀API或SQL(數據幀API支持SQL支持一切,包括任何連接,你需要條件):

DataFrame API:

df.as("table1").join(
    df2.as("table2"), 
    $"table1.name" === $"table2.name" && $"table1.age" === $"table2.howold", 
    "leftanti" 
) 

SQL:

sqlContext.sql(
    """SELECT table1.* FROM table1 
    | LEFT ANTI JOIN table2 
    | ON table1.name = table2.name AND table1.age = table2.howold 
    """.stripMargin) 

:它也值得注意的是,有而不單獨指定模式產生的採樣數據,使用元組和隱式toDF方法的更短,更簡潔的方式,然後「固定」,其中所需的自動-推斷出的模式:

val df = List(
    ("mike", 26, true), 
    ("susan", 26, false), 
    ("john", 33, true) 
).toDF("name", "age", "isBoy") 

val df2 = List(
    ("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)), 
    ("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)), 
    ("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)), 
    ("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)), 
    ("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime)) 
).toDF("name", "grade", "howold", "hobby", "birthday").withColumn("birthday", $"birthday".cast(DateType)) 
0

你可以使用內置函數except (我會用你提供的代碼,但你沒有包括進口,所以我不能只是c/p它:()

val a = sc.parallelize(Seq((1,"a",123),(2,"b",456))).toDF("col1","col2","col3") 
val b= sc.parallelize(Seq((4,"a",432),(2,"t",431),(2,"b",456))).toDF("col1","col2","col3") 

scala> a.show() 
+----+----+----+ 
|col1|col2|col3| 
+----+----+----+ 
| 1| a| 123| 
| 2| b| 456| 
+----+----+----+ 


scala> b.show() 
+----+----+----+ 
|col1|col2|col3| 
+----+----+----+ 
| 4| a| 432| 
| 2| t| 431| 
| 2| b| 456| 
+----+----+----+ 

scala> a.except(b).show() 
+----+----+----+ 
|col1|col2|col3| 
+----+----+----+ 
| 1| a| 123| 
+----+----+----+ 
+1

我正在查找SQL查詢,因爲我需要能夠指定要在兩個表之間進行比較的列,而不僅僅是逐行比較 – user2975535