1
我正在閱讀社交網絡的json文件爲spark。我從這些數據框中獲得了我爲了獲得配對而爆炸的數據。 這個過程很完美。稍後我想將其轉換爲RDD(用於GraphX),但創建RDD需要很長時間。火花數據幀轉換爲rdd需要很長時間
val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
withColumn("follower", explode($"followers")).
withColumn("id_follower", ($"follower").cast("long")).
withColumn("id_account", ($"account").cast("long")).
withColumn("relationship", lit(1)).
select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd
要檢查的過程是如何運行的,我算在每一步
scala> exploded_network.count
res0: Long = 18205814 // 3 seconds
scala> E1.count
res1: Long = 18205814 // 3 seconds
scala> E2.count // 5.4 minutes
res2: Long = 18205814
爲什麼RDD轉換以100倍?