2016-09-27 62 views
0

我與火花和Scala工作,有一個案例類中定義一個數據集,看起來像這樣:Spark和Scala的最佳方式映射的情況下classese

case class Shareholders(
    business_id : String, 
    guo_name : String, 
    guo_id : String, 
    duo_name : String, 
    duo_id : String 
) 

有許多與「過」開始多領域/ 「雙核」。除了這個前綴,字段名稱是相同/重複的。

我想形成的情況下的類結構,它看起來像:

case class NewShareholders(
    business_id : String, 
    repeatedFields : Seq[RepeatedShareholderFields] 
) 

case class RepeatedShareholderFields ( 
    name : String, 
    id : String 
    type : String 
) 

其中type =「過」 /「哆」等適當。

如何做到最好?

+0

我強烈建議設計Spark'Dataset's,好像它們是SQL關係表。這就是現在Spark優化的。從你的例子到目前爲止,我不能爲你提出一個好的解決方案;我唯一能說的是諮詢DBA如何最好地規範你的「股東」表。 – Yawar

回答

0
import scala.language.existentials 

case class NewShareholder(
    businessId : String, 
    fields : Seq[ShareholderField[T forSome {type T}]] 
) 

case class ShareholderField[T] ( 
    prefix : String, 
    nameValue: T 
    idValue: T,  
) 

// Now you can create you share holders as follows, 

val sh1 = NewShareHolder(
    businessId = "abcde-1234" 
    fields = Seq(
    ShareholderField[String]("guo", "guo-name-1", "guo-id-1") 
    ShareholderField[String]("duo", "duo-name-1", "duo-id-1") 
    ShareholderField[UUID]("luo", UUID.randomUUID(), UUID.randomUUID()) 
) 
) 

如果您知道所有值實際上是String,那麼您可以簡化它。

case class NewShareholder(
    businessId : String, 
    fields : Seq[ShareholderField] 
) 

case class ShareholderField ( 
    prefix : String, 
    nameValue: String, 
    idValue: String,  
) 

// Now you can create you share holders as follows, 

val sh1 = NewShareHolder(
    businessId = "abcde-1234" 
    fields = Seq(
    ShareholderField("guo", "guo-name-1", "guo-id-1") 
    ShareholderField("duo", "duo-name-1", "duo-id-1") 
    ShareholderField("luo", "luo-name-1", "luo-id-1") 
) 
) 
相關問題