3
我要執行這兩個PySpark DataFrames之間的連接:加入PySpark DataFrames嵌套場
from pyspark import SparkContext
from pyspark.sql.functions import col
sc = SparkContext()
df1 = sc.parallelize([
['owner1', 'obj1', 0.5],
['owner1', 'obj1', 0.2],
['owner2', 'obj2', 0.1]
]).toDF(('owner', 'object', 'score'))
df2 = sc.parallelize(
[Row(owner=u'owner1',
objects=[Row(name=u'obj1', value=Row(fav=True, ratio=0.3))])]).toDF()
的加入必須在對象,即場名內的名稱進行對象對於df2和對象對於df1。
我能夠在嵌套場執行SELECT,如
df2.where(df2.owner == 'owner1').select(col("objects.value.ratio")).show()
,但我不能夠運行這個連接:
df2.alias('u').join(df1.alias('s'), col('u.objects.name') == col('s.object'))
返回錯誤
pyspark.sql.utils.AnalysisException:由於數據類型,u「無法解析 '(objects.name = cast(object as double))' '(objects.name = cast(object as double))'(array and double);「
任何想法如何解決這個問題?