我寫了一些火花代碼,我有一個RDD它看起來像呼籲「pyspark.resultiterable.ResultIterable」
[(4, <pyspark.resultiterable.ResultIterable at 0x9d32a4c>),
(1, <pyspark.resultiterable.ResultIterable at 0x9d32cac>),
(5, <pyspark.resultiterable.ResultIterable at 0x9d32bac>),
(2, <pyspark.resultiterable.ResultIterable at 0x9d32acc>)]
我需要做的是調用不同的pyspark.resultiterable.ResultIterable
我想這
def distinctHost(a, b):
p = sc.parallelize(b)
return (a, p.distinct())
mydata.map(lambda x: distinctHost(*x))
但我得到一個錯誤:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
錯誤是不言自明的,我不能使用sc。但我需要找到一種方法來覆蓋pyspark.resultiterable
。 ResultIterable
添加到RDD,以便我可以調用截然不同的方法。
我認爲集會做得很好。謝謝! –