在Python中組合兩條不同的線Spark Spark RDD

我在處理python spark rdd時遇到了一些小問題。我RDD看起來像在Python中組合兩條不同的線Spark Spark RDD

old_rdd = [(A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].

我想用flatMap，從而獲得新的RDD，如：

new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.

問題是flatMap去除元組像[(A1, V1, A2, V2)...].你有帶或不帶flatMap任何其他建議（）。先謝謝你。

來源

2015-11-13 Aarav

什麼是模式？，所有組合？排列？或只有一些對？ –

這是組合 – Aarav

它與Explicit sort in Cartesian transformation in Scala Spark有關。不過，我會假設你已經清理RDD的重複，我將認爲ids有一些簡單的圖案來分析，然後確定，爲簡單起見，我會想起Lists而不是Vectors

old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])]) 

# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter. 
combined_rdd = old_rdd.cartesian(old_ 
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0]) 

combinations.collect() 

# The output will be... 
# ----------------------------- 
# [((1, [1, -2]), (2, [5, 7])), 
# ((1, [1, -2]), (3, [8, 23])), 
# ((1, [1, -2]), (4, [-1, 90])), 
# ((2, [5, 7]), (3, [8, 23])), 
# ((2, [5, 7]), (4, [-1, 90])), 
# ((3, [8, 23]), (4, [-1, 90]))] 

# Now we need to set the tuple as you want 
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect() 

# The output will be... 
# ---------------------- 
# [((1, 2), ([1, -2], [5, 7])), 
# ((1, 3), ([1, -2], [8, 23])), 
# ((1, 4), ([1, -2], [-1, 90])), 
# ((2, 3), ([5, 7], [8, 23])), 
# ((2, 4), ([5, 7], [-1, 90])), 
# ((3, 4), ([8, 23], [-1, 90]))]

來源

2015-11-13 20:08:13

阿爾貝託，非常感謝你，你做了我的一天。 – Aarav

在Python中組合兩條不同的線Spark Spark RDD

回答

相關問題