Spark數據集agg方法

我正在使用Spark和dataSet API創建一些分析數據集。我得分手在那裏我calcuating一些變量，它看起來是這樣的：Spark數據集agg方法

CntDstCdrs1.groupByKey(x => (x.bs_recordid, x.bs_utcdate)).agg(
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_1" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_1day"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_3" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_3day_cust"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_5" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_5day_cust"), 
    count(when(($"bc_sub_org_id" === lit(500) && $"bc_utcdate" >= $"day_7" && $"bc_utcdate" <= $"bs_utcdate") , $"bc_phonenum")).as[Long].name("count_phone_7day_cust") 
).show()

此代碼工作正常，但是當我嘗試添加一個計數變量「count_phone_30day」我得到一個錯誤.. 「方法重載...」這意味着dataSet上的agg方法簽名最多需要4個Column表達式？無論如何，如果這種方法不是計算大量變量的最佳實踐，那麼哪一個會是？我有數，統計不同的總和等變量。

KR，斯特凡

來源

2017-09-23 StStojanovic

的'方法overloaded'錯誤可能是別的東西造成的，如'agg'上'Dataset'可以採取比4路以上在'when'條件下聚合函數。 –

@LeoC它可以，但在關係'groupBy'中，鍵值'groupByKey'具有其他實現 –

Dataset.groupByKey回報KeyValueGroupedDataset。

這個類沒有agg可變參數 - 你可以只提供4列作爲參數

來源

2017-09-24 17:31:02

Spark數據集agg方法

回答

相關問題