相交我喜歡執行以下操作:星火與WrappedArray
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
import org.apache.spark.sql.SparkSession
import spark.implicits._
val spark = SparkSession.builder().appName("SameInterest").getOrCreate()
val d1 = spark.read.json ("/path/data1").select("Name","Interest").createOrReplaceTempView("d1_sql")
val d2 = spark.read.json ("/path/data2").select("Name","Interest").createOrReplaceTempView("d2_sql")
val sql_script = "SELECT d1_sql.Name as Name , d1_sql.Interest as Interest1 , d2_sql.Interest as Interest2 FROM d1_sql, d2_sql WHERE d1_sql.Name = d2_sql.Name"
val dosql = spark.sql(sql_script)
val sameIP_UU = dosql.rdd.filter(X => Array(X(1)).intersect(Array(X(2))).length>0)
我想做intersect
與D1和D2 Interest
列,但我不能得到正確的答案。
數據和架構是:
{"name":"John","Interest1":{"bag_0":[{"Interest":"110"},{"Interest":"220"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}
{"name":"Allen","Interest1":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}
printSchema():
|-- Name: string (nullable = true)
|-- Interest1: struct (nullable = true)
| |-- bag_0: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Interest: string (nullable = true)
|-- Interest2: struct (nullable = true)
| |-- bag_0: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Interest: string (nullable = true)
我想答案一定是2,但我總是得到答案1.
而且我發現數據結構有一個WrappedArray: [WrappedArray([110],[220],[333])]
這可能是我答錯的原因,但我不知道如何從WrappedArray獲得的價值和使用intersect
編輯:
dosql.take(1)
res47: Array[org.apache.spark.sql.Row] = Array([John,[WrappedArray([110], [220], [333])],[WrappedArray([110], [220], [333])]])
你的代碼在select(「Name」,「Interest」)'上立即失敗,因爲沒有'Interest'列 - 應該是'Interest1'嗎? 'Interest2'?示例數據是'/ path/data1'還是'/ path/data2'?或兩者? –
對不起,我沒有清楚地解釋我的原始數據結構。 data1和data2與名稱和興趣相同的列,所以我會通過select(「Name」,「Interest」)正確選擇值 – chilun