2017-07-18 18 views

回答

0

可以使用collect_set在pyspark把它完成,

df.groupby('location','date').agg(F.collect_set('student_id')).show() 

+--------+----------+-----------------------+ 
|location|  date|collect_set(student_id)| 
+--------+----------+-----------------------+ 
| 18250|2015-01-04|    [347416]| 
| 18253|2015-01-02|  [167633, 188734]| 
| 18250|2015-01-03|    [363796]| 
+--------+----------+-----------------------+ 
0

假設你的數據在表格的行:

(location, date, student_id) 

你可以這樣做:

data 
.map(lambda row: (row[0:2], {row[2]}) 
.reduceByKey(lambda a, b: a.union(b)) 
.collect()