斯卡拉合併兩個或更多個字符串作爲一個JSON屬性

數組我有很多很多的文件JSON字符串線，他們有這樣的：斯卡拉合併兩個或更多個字符串作爲一個JSON屬性

{ "id":123, "team":"A", "etc":"...", ...} 
{ "id":124, "team":"A", "etc":"...", ...} 
{ "id":124, "team":"B", "etc":"...", ...} 
{ "id":125, "team":"A", "etc":"...", ...}

我可以在Scala中加載它們的數據幀。

通過用ID分組，我想這樣的：

{ "id":123, "team":"A", "etc":"...", ...} 
{ "id":124, "team":["A","B"], "etc":"...", ...} 
{ "id":125, "team":"A", "etc":"...", ...}

在Scala中，我該怎麼辦呢？

注：我不知道子屬性有多少是在每個JSON。大多數屬性在json行中都很常見。但是在幾個json行中可能會有一些獨特的屬性。

來源

2017-02-24 Daebarkee

做你想要做這Apache的火花？ –

是的！ Apache的火花。 – Daebarkee

如果我理解正確，您希望按ID進行分組並將每個單獨列收集爲列表？

更新使用列的動態列表：

df: org.apache.spark.sql.DataFrame = [etc: string, id: bigint ... 1 more field] 

scala> df.show 
+---+---+----+ 
|etc| id|team| 
+---+---+----+ 
| X|123| A| 
| Y|124| A| 
| Z|124| B| 
| X|125| A| 
+---+---+----+ 

val grpCol = "id" 
val collectCols = (df.columns.toSet - grpCol).map(c => collect_list(c).as(c)).toSeq 

df.groupBy('id).agg(collectCols.head, collectCols.tail: _*).show 

+---+------+------+ 
| id| etc| team| 
+---+------+------+ 
|124|[Y, Z]|[A, B]| 
|123| [X]| [A]| 
|125| [X]| [A]| 
+---+------+------+

來源

2017-02-24 03:35:54 Traian

謝謝。但是，還有一個問題。我不知道其他專欄會有多少。有沒有什麼聰明的方法可以調用collect_list（）獲取不定數量的列？ – Daebarkee

更新爲使用動態列列表。 – Traian

謝謝@PatRox。這完美的作品！ – Daebarkee

斯卡拉合併兩個或更多個字符串作爲一個JSON屬性

回答

相關問題