2016-10-13 34 views
2

我正在尋找一種方法來按鍵組合兩個RDD。如何正確使用PySpark完成兩個RDD的外部連接?

考慮:

x = sc.parallelize([('_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE=', 'FR', '75001'), 
       ('_guid_XblBPCaB8qx9SK3D4HuAZwO-1cuBPc1GgfgNUC2PYm4=', 'TN', '8160'), 
       ] 
      ) 
y = sc.parallelize([('_guid_oX6Lu2xxHtA_T93sK6igyW5RaHH1tAsWcF0RpNx_kUQ=', 'JmJCFu3N'), 
       ('_guid_hG88Yt5EUsqT8a06Cy380ga3XHPwaFylNyuvvqDslCw=', 'KNPQLQth'), 
       ('_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE=', 'KlGZj08d'), 
       ] 
      ) 

我找到了解決辦法!儘管如此,這個解決方案並不完全符合我想要做的。 我創建了一個功能,以指定我的鑰匙將其應用到我的名爲 「X」 的RDD:

[('_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE=', ('FR', '75001')), 
('_guid_XblBPCaB8qx9SK3D4HuAZwO-1cuBPc1GgfgNUC2PYm4=', ('TN', '8160'))] 

然後:

def get_keys(rdd): 

    new_x = rdd.map(lambda item: (item[0], (item[1], item[2]))) 
    return new_x 

new_x = get_keys(x) 

這給

new_x.union(y).map(lambda (x, y): (x, [y])).reduceByKey(lambda p, q : p + q).collect() 

的結果:

[('_guid_oX6Lu2xxHtA_T93sK6igyW5RaHH1tAsWcF0RpNx_kUQ=', ['JmJCFu3N']), 
('_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE=', [('FR', '75001'), 'KlGZj08d']), 
('_guid_XblBPCaB8qx9SK3D4HuAZwO-1cuBPc1GgfgNUC2PYm4=', [('TN', '8160')]), 
('_guid_hG88Yt5EUsqT8a06Cy380ga3XHPwaFylNyuvvqDslCw=', ['KNPQLQth'])] 

我想要的是:

[('_guid_oX6Lu2xxHtA_T93sK6igyW5RaHH1tAsWcF0RpNx_kUQ=', (None, None, 'JmJCFu3N')), 
('_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE=', ('FR', '75001', 'KlGZj08d')), 
('_guid_XblBPCaB8qx9SK3D4HuAZwO-1cuBPc1GgfgNUC2PYm4=', ('TN', '8160', None)), 
('_guid_hG88Yt5EUsqT8a06Cy380ga3XHPwaFylNyuvvqDslCw=', (None, None, 'KNPQLQth'))] 

幫助?

回答

2

爲什麼不呢?

>>> new_x.fullOuterJoin(y) 

>>> x.toDF().join(y.toDF(), ["_1"], "fullouter").rdd 
+0

LostInOverflow:。x.toDF()加入(y.toDF(),[ 「_1」], 「fullouter」)rdd.collect()給出:[行(_1 = u'_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE =',_2 = u'FR',_3 = u'75001',_2 = u'K1GZj08d')]。 new_x.fullOuterJoin(y)更好。 – DataAddicted

+0

[( '_guid_oX6Lu2xxHtA_T93sK6igyW5RaHH1tAsWcF0RpNx_kUQ =',(無, 'JmJCFu3N')), ( '_guid_YWKnKkcrg_Ej0icb07bhd-mXPjw-FcPi764RRhVrOxE =', >>> new_x.fullOuterJoin(Y)==>(( 'FR' , '75001'), 'KlGZj08d')), ( '_guid_XblBPCaB8qx9SK3D4HuAZwO-1cuBPc1GgfgNUC2PYm4 =', (( 'TN', '8160'),無)), ( '_guid_hG88Yt5EUsqT8a06Cy380ga​​3XHPwaFylNyuvvqDslCw =',(無, 'KNPQLQth' ))] – DataAddicted

+0

在(('FR','75001'),'KlGZj08d')中有「FR」和「75001」分開的方法嗎? – DataAddicted

相關問題