PySpark數據幀操縱效率

假設我有以下的數據幀：PySpark數據幀操縱效率

+----------+-----+----+-------+ 
|display_id|ad_id|prob|clicked| 
+----------+-----+----+-------+ 
|  123| 989| 0.9|  0| 
|  123| 990| 0.8|  1| 
|  123| 999| 0.7|  0| 
|  234| 789| 0.9|  0| 
|  234| 777| 0.7|  0| 
|  234| 769| 0.6|  1| 
|  234| 798| 0.5|  0| 
+----------+-----+----+-------+

我然後執行下列操作，以得到一個最終的數據集（該代碼如下所示）：

# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise 
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0)) 

# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked) 
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"]) 

# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob) 
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id')) 

# Define function to sort list of tuples by prob and create list of only ad_ids 
def sortByRank(ad_id_list): 
    sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True) 
    sortedIds = [i[0] for i in sortedVersion] 
    return(sortedIds) 

# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted 
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType())) 
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)')) 

# Function to change clickedAdId into an array of size 1 
def createClickedSet(clickedAdId): 
    setOfDocs = [clickedAdId] 
    return setOfDocs 

clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType())) 
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)')) 

# Select the necessary columns 
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set') 

+----------+--------------------+---------+ 
|display_id|ad_id_sorted  |ad_id_set| 
+----------+--------------------+---------+ 
|234  |[789, 777, 769, 798]|[769] | 
|123  |[989, 990, 999]  |[990] | 
+----------+--------------------+---------+

是否有一個更有效的方式來做到這一點？以我的方式進行這套轉換似乎是我代碼中的瓶頸。我將不勝感激任何反饋。

來源

2016-12-13 user2253546

我沒有做過任何時間比較，但我會認爲，不使用任何UDFs Spark應該能夠優化自身。

#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked") 
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one 

dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"]) 
dfad.registerTempTable("df_ad") 



df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id") 
+----------+--------------------+ 
|display_id|  ad_id_sorted| 
+----------+--------------------+ 
|  234|[789, 777, 769, 798]| 
|  123|  [989, 990, 999]| 
+----------+--------------------+ 

df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id") 
+----------+---------+ 
|display_id|ad_id_set| 
+----------+---------+ 
|  234|  769| 
|  123|  990| 
+----------+---------+ 


final_df = df1.join(df2,"display_id") 
+----------+--------------------+---------+ 
|display_id|  ad_id_sorted|ad_id_set| 
+----------+--------------------+---------+ 
|  234|[789, 777, 769, 798]|  769| 
|  123|  [989, 990, 999]|  990| 
+----------+--------------------+---------+

我沒有把ad_id_set到一個數組，因爲你計算最大和最多隻能返回值爲1。我相信如果你真的需要它，你可以做到這一點。

如果未來某個使用Scala的人有類似的問題，我將包含微妙的Scala差異。

來源

2016-12-15 20:51:07

感謝您提供此解決方案。我計時了兩個解決方案。您的解決方案在1.38毫秒內執行，原始解決方案在2.01毫秒內執行。 :) – user2253546

PySpark數據幀操縱效率

回答

相關問題