有效計算pyspark中的連接組件

我正在嘗試在城市中找到朋友的連接組件。我的數據是具有城市屬性的邊緣列表。有效計算pyspark中的連接組件

城市| SRC | DEST

火箭的凱爾 - >尼

休斯敦班尼 - >查爾斯

休斯頓查爾斯 - >丹尼

奧馬哈卡羅爾 - >布賴恩

等

我知道pyspark的GraphX庫的connectedComponents函數將遍歷圖的所有邊以找到連接的組件，並且我想避免這一點。我會怎麼做？

編輯：我想我可以做這樣的事情，從數據幀 GROUPBY城市

其中connected_components生成的項目列表

選擇connected_components（*）。

來源

2017-09-25 oliver

避免問同樣的問題兩次：https://stackoverflow.com/questions/46386182/how-would -i-phrase-this-python-code-in-pyspark-sql-or-sql – Mariusz

刪除舊的，這個有更好的措辭。 – oliver

假設你的數據是這樣的

import org.apache.spark._ 
import org.graphframes._ 

val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"), 
      ("Houston","Charles","Denny"),("Omaha","carol","Brian"), 
      ("Omaha","Brian","Daniel"),("Omaha","Sara","Marry")) 
var df = spark.createDataFrame(l).toDF("city","src","dst")

創建要運行連接部件 cities = List("Houston","Omaha")

現在，在城市名單上運行的城市列的過濾器對每個城市的城市名單，然後從結果數據框中創建邊和頂點數據幀。創建從這些邊緣和頂點dataframes一個graphframe和運行連接組件的算法

val cities = List("Houston","Omaha") 

for(city <- cities){ 
    val edges = df.filter(df("city") === city).drop("city") 
    val vert = edges.select("src").union(edges.select("dst")). 
        distinct.select(col("src").alias("id")) 
    val g = GraphFrame(vert,edges) 
    val res = g.connectedComponents.run() 
    res.select("id", "component").orderBy("component").show() 
}

輸出

|  id| component| 
+-------+------------+ 
| Kyle|249108103168| 
|charles|249108103168| 
| Benny|249108103168| 
|Charles|721554505728| 
| Denny|721554505728| 
+-------+------------+ 

+------+------------+               
| id| component| 
+------+------------+ 
| Marry|858993459200| 
| Sara|858993459200| 
| Brian|944892805120| 
| carol|944892805120| 
|Daniel|944892805120| 
+------+------------+

來源

2017-09-26 16:46:03 ashwinids

謝謝你的工作！好吧，喂。我認爲可能會比金屬更接近一點，而不是循環訪問我想阻止的值，但我仍然感謝您的回答 – oliver

有效計算pyspark中的連接組件

回答

相關問題