您可以使用子查詢得到的結果集(CUSTOMER_ID,ITEM_ID,item_rank ),按item_rank排序,然後在外部查詢中使用collect_set
。
查詢
WITH table1 AS (
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
subquery.customer_id,
collect_set(subquery.item_id) AS item_id_set
FROM (
SELECT
table1.customer_id,
table1.item_id,
table1.item_rank
FROM table1
DISTRIBUTE BY
table1.customer_id
SORT BY
table1.customer_id,
table1.item_rank
) subquery
GROUP BY
subquery.customer_id
;
結果
customer_id item_id_set
0 23 [4,2]
1 25 [5,4]
子查詢使用DISTRIBUTE BY
保證所有行特定customer_id
路線相同的減速。然後使用SORT BY
按每個減速器內的customer_id
和item_rank
排序。我預計這對於需求來說是足夠的,因爲我沒有注意到最終結果集的總排序要求。 (如果customer_id
總排序是必需的,那麼我想查詢將不得不使用ORDER BY
,這將導致更慢的執行。)
內部,collect_set
UDAF使用Java LinkedHashSet
,這是一個保序集,所以在子查詢中使用的排序順序將保留在外部查詢的集合中。這是在蜂巢代碼庫這裏看到:
https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93
謝謝克里斯,但不會有與一個更大的數據集的問題嗎?從我可以看到排序或甚至羣集不能確保全球秩序https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by – vkaul11
@ vkaul11,偉大觀察!我已經更新了使用'DISTRIBUTE BY'和'SORT BY'的答案。 (我沒有注意到總排序的要求,所以我沒有使用'ORDER BY'。) –
不,克里斯總排序不是必需的,所以它確實足夠。 – vkaul11