0

自定義對象我想叫API,它預計Employee對象,如下圖所示:轉換火花數據集在斯卡拉

public class EmployeeElements { 
    private Set<Long> eIds; 
    private Map<Long, List<EmployeeDetails>> employeeDetails; 
    private Map<Long, List<Address>> address; 
} 

我想創建格式對象:Map[Long,List[CustomObject]]了數據集

由於輸入我們將有以下數據框。

EmployeeDF 

+---------+-------------------+ 
|emp_Id |hired_Date   | 
+---------+-------------------+ 
|1387779 |2016-09-27 19:47:28| 
|1387781 |2016-09-27 19:47:28| 
|1387780 |2016-09-27 19:47:28| 
|1387780 |2016-09-27 19:47:28| 
|1387777 |2016-09-27 19:47:28| 
|1387778 |2016-09-27 19:47:28| 
+---------+-------------------+ 

EmployeeDetailsDF: 

+---------+---------+--------+----------+-------------------+ 
|emp_Id |firstname|lastname|gendercode|dateofbirth  | 
+---------+---------+--------+----------+-------------------+ 
|1387777 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387778 |Jon  |Snow |M   |1985-01-01 00:00:00| 
|1387779 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387780 |Jon  |Snow |F   |1985-01-01 00:00:00| 
|1387781 |Jon  |Snow |F   |1985-01-01 00:00:00| 
+---------+---------+--------+----------+-------------------+ 

AddressDf: 

+---------+------------------+-----------------+-------------------+ 
|emp_Id |patient_address_id|Country   |joined_Date  | 
+---------+------------------+-----------------+-------------------+ 
|1387779 |2435146   |USA    |2016-09-27 19:47:28| 
|1387781 |2435148   |AUS    |2016-09-27 19:47:28| 
|1387780 |2435147   |USA    |2016-09-27 19:47:28| 
|1387780 |2435149   |UK    |2016-09-27 19:47:28| 
|1387777 |2435144   |AUS    |2016-09-27 19:47:28| 
|1387778 |2435145   |USA    |2016-09-27 19:47:28| 
+---------+------------------+-----------------+-------------------+ 

EmployeeDetailsDF將獲得參加與EmployeeDF。 同樣AddressDf也將加入EmployeeDF

Now out of that join Dataframe I wanted to create `Map[Long,List[CustomObject]]`, for example: couple of rows of joindDataFrame of AddressDf and EmployeeDF 

(1387777,List([1387777,2435144,AUS])) 
(1387780,List([1387780,2435147,USA], [1387780,2435149,UK])) 

Exception: 
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.GroupedDataset cannot be cast to scala.collection.immutable.Map 
when I am trying to cast groupAddressDs to Map: 
//empAddressDs is the joined Dataset of AddressDf and EmployeeDF 
val groupAddressDs:GroupedDataset[Long, Address] = empAddressDs.groupBy { x => x.emp_Id }  
val addressMap = groupAddressDs.asInstanceOf[Map[String, List[Procedure]]] 

但我沒有得到同樣的任何解決方案,我試圖與GroupedDataset但最終我們無法施展它映射對象它給人類轉換異常, 然後用pairRdd嘗試過,但後來我我無法將rdd的行再次轉換爲我的自定義對象。

我必須處理EmployeeElements對象中的數百萬個數據和數百個自定義對象。

+0

您是否試圖將百萬數據編組爲一個EmployeeElements對象?你最終需要用EmployeeElements做什麼? – josephpconley

+0

@josephpconley:我想將EmployeeElements對象傳遞給外部API。 – Kalpesh

回答

0
Instead of doing asInstanceOf on group data set execute mapGroups function on groupDataset 
Foe e.g. 
groupAddressDs.mapGroups { case (k, xs) => { 
       var employeeElements = new EmployeeElements() 
    employeeElements.setAddress(mapAsJavaMap(Map[Long,List[Address]](k -> xs.toList)).asInstanceOf[Long, List[Address]]]) 
    employeeElements.setEIds(k) 
(k, xs.toSeq) 
} 
}