我要過濾的RDD源的列:如何才能不使用IN子句中的過濾條件的火花
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
我想用在過濾條件子句中過濾出存在於src中的值從源代碼,類似下面(編者):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
相當於SQL代碼
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
謝謝
你的值是什麼類型? – eliasah
數據類型可能會有所不同,有時INT和有時字符串 – Vignesh
這不是我所要求的。 'src'或'source'的類型是什麼?你在使用RDD或DataFrame嗎? – eliasah