2017-04-12 56 views
1

的名單我有一個數據幀像下面 -Pyspark - 排序數據框中列包含列表

+----------+-------+-------------------------------------------------+ 
| WindowID | State |           Details | 
+----------+-------+-------------------------------------------------+ 
|  6 | SD | [[29916,3], [156570,4], [245934,1], [329748,8]] | 
|  3 | CO |    [[524586,2], [1548,3], [527220,1]] | 
+----------+-------+-------------------------------------------------+ 

現在,我想Details列的每一行中以列表的第二個元素的降序排列。結果應該是 -

+----------+-------+-------------------------------------------------+ 
| WindowID | State |           Details | 
+----------+-------+-------------------------------------------------+ 
|  6 | SD | [[329748,8], [156570,4], [29916,3], [245934,1]] | 
|  3 | CO |    [[1548,3], [524586,2], [527220,1]] | 
+----------+-------+-------------------------------------------------+ 

我該怎麼做在pyspark?先謝謝你。

回答

0

我發現了一個簡單的技巧對於這個問題 -

import operator 

mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]], 
      [3, 'CO', [[524586,2], [1548,3], [527220,1]]]], 
      ['WindowID', 'State', 'Details']).show(truncate=False) 

+----------+-------+-------------------------------------------------+ 
| WindowID | State |           Details | 
+----------+-------+-------------------------------------------------+ 
|  6 | SD | [[29916,3], [156570,4], [245934,1], [329748,8]] | 
|  3 | CO |    [[524586,2], [1548,3], [527220,1]] | 
+----------+-------+-------------------------------------------------+ 

sorted_df = mydf.rdd.map(lambda x: [x[0], x[1], sorted(x[2], \ 
      key=operator.itemgetter(1), reverse=True)]) \ 
      .toDF(['WindowID', 'State', 'Details']) \ 
      .show(truncate=False) 

+----------+-------+-------------------------------------------------+ 
| WindowID | State |           Details | 
+----------+-------+-------------------------------------------------+ 
|  6 | SD | [[329748,8], [156570,4], [29916,3], [245934,1]] | 
|  3 | CO |    [[1548,3], [524586,2], [527220,1]] | 
+----------+-------+-------------------------------------------------+ 
0

我不知道你的嘗試,但檢查下面的解決方案,這將爲你工作。

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType 
from pyspark.sql.functions import udf 

dfSchema = StructType([StructField('WindowID', IntegerType(), True), 
        StructField('State', StringType(), True), 
        StructField('Details', ArrayType(ArrayType(IntegerType())), True)]) 
#["WindowID", "State", "Details"] 
mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]], 
[3, 'CO', [[524586,2], [1548,3], [527220,1]]]], dfSchema) 
mydf.show(truncate = False) 

+--------+-----+---------------------------------------------------------------------------------------------------+ 
|WindowID|State|Details                       | 
+--------+-----+---------------------------------------------------------------------------------------------------+ 
|6  |SD |[WrappedArray(29916, 3), WrappedArray(156570, 4), WrappedArray(245934, 1), WrappedArray(329748, 8)]| 
|3  |CO |[WrappedArray(524586, 2), WrappedArray(1548, 3), WrappedArray(527220, 1)]       | 
+--------+-----+---------------------------------------------------------------------------------------------------+ 

def def_sort(x): 
     return sorted(x, key=lambda x:x[1], reverse=True) 

udf_sort = udf(def_sort, ArrayType(ArrayType(IntegerType()))) 
mydf.select("windowID", "State", udf_sort("Details")).show(truncate = False) 


+--------+-----+---------------------------------------------------------------------------------------------------+ 
|windowID|State|PythonUDF#def_sort(Details)                  | 
+--------+-----+---------------------------------------------------------------------------------------------------+ 
|6  |SD |[WrappedArray(329748, 8), WrappedArray(156570, 4), WrappedArray(29916, 3), WrappedArray(245934, 1)]| 
|3  |CO |[WrappedArray(1548, 3), WrappedArray(524586, 2), WrappedArray(527220, 1)]       | 
+--------+-----+---------------------------------------------------------------------------------------------------+ 
相關問題