2017-09-24 118 views
0

我有一個數據框表示圖的邊;這是模式:Scala-Spark:將數據幀轉換爲RDD [Edge]

root |-- src: string (nullable = true) 
    |-- dst: string (nullable = true) 
    |-- relationship: struct (nullable = false) 
    | |-- business_id: string (nullable = true) 
    | |-- normalized_influence: double (nullable = true) 

我想將其轉換爲RDD [邊緣]與預凝膠API和我困難的工作是對屬性的「關係」。如何轉換它?

回答

1

Edge是一個參數化的類。這意味着除了源代碼和目標代碼之外,您還可以在每個邊緣存儲您喜歡的任何內容。在你的情況下,它可能是一個Edge[Relationship]。您可以使用案例類來映射數據幀和RDD[Edge[Relationship]]

import scala.util.hashing.MurmurHash3 
case class Relationship(business_id: String, normalized_influence: Double) 
case class MyEdge(src: String, dst: String, relationship: Relationship) 

val edges: RDD[Edge[Relationship]] = df.as[MyEdge].rdd.map { edge => 
    Edge(
     MurmurHash3.stringHash(edge.src).toLong, // VertexId type is a Long, so we need to hash your string 
     MurmurHash3.stringHash(edge.dst).toLong, 
     edge.relationship 
    ) 
}