2017-04-13 70 views
0

相交我喜歡執行以下操作:星火與WrappedArray

import org.apache.spark.SparkContext 
import org.apache.spark.SparkConf 
import org.apache.spark.rdd._ 
import org.apache.spark.sql.SparkSession 

import spark.implicits._ 

val spark = SparkSession.builder().appName("SameInterest").getOrCreate() 

val d1 = spark.read.json ("/path/data1").select("Name","Interest").createOrReplaceTempView("d1_sql") 
val d2 = spark.read.json ("/path/data2").select("Name","Interest").createOrReplaceTempView("d2_sql") 

val sql_script = "SELECT d1_sql.Name as Name , d1_sql.Interest as Interest1 , d2_sql.Interest as Interest2 FROM d1_sql, d2_sql WHERE d1_sql.Name = d2_sql.Name" 

val dosql = spark.sql(sql_script) 

val sameIP_UU = dosql.rdd.filter(X => Array(X(1)).intersect(Array(X(2))).length>0) 

我想做intersect與D1和D2 Interest列,但我不能得到正確的答案。

數據和架構是:

{"name":"John","Interest1":{"bag_0":[{"Interest":"110"},{"Interest":"220"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}} 
{"name":"Allen","Interest1":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}} 

printSchema():

|-- Name: string (nullable = true) 
    |-- Interest1: struct (nullable = true) 
    | |-- bag_0: array (nullable = true) 
    | | |-- element: struct (containsNull = true) 
    | | | |-- Interest: string (nullable = true) 
    |-- Interest2: struct (nullable = true) 
    | |-- bag_0: array (nullable = true) 
    | | |-- element: struct (containsNull = true) 
    | | | |-- Interest: string (nullable = true) 

我想答案一定是2,但我總是得到答案1.

而且我發現數據結構有一個WrappedArray: [WrappedArray([110],[220],[333])]

這可能是我答錯的原因,但我不知道如何從WrappedArray獲得的價值和使用intersect

編輯:

dosql.take(1) 
res47: Array[org.apache.spark.sql.Row] = Array([John,[WrappedArray([110], [220], [333])],[WrappedArray([110], [220], [333])]]) 
+0

你的代碼在select(「Name」,「Interest」)'上立即失敗,因爲沒有'Interest'列 - 應該是'Interest1'嗎? 'Interest2'?示例數據是'/ path/data1'還是'/ path/data2'?或兩者? –

+0

對不起,我沒有清楚地解釋我的原始數據結構。 data1和data2與名稱和興趣相同的列,所以我會通過select(「Name」,「Interest」)正確選擇值 – chilun

回答

0

它應該是:

import org.apache.spark.sql.Row 

dosql.rdd.filter { 
    row => 
    row.getSeq[Row]("Interest1").intersect(row.getSeq[Row]("Interest1")).size > 0 
} 
+0

我試過了,仍然有'error:type mismatch;找到:Seq [org.apache.spark.sql.Row] required:Boolean' – chilun