Hive collect_set（）

假設我有兩個表：timeperiod1和timeperiod2。Hive collect_set（）

timeperiod1有模式像這樣：

cluster characteristic 
A  1 
A  2 
A  3 
B  2 
B  3

timeperiod2具有像這樣的模式：

cluster characteristic 
A  1 
A  2 
B  2 
B  3 
B  4

我要計算由集羣中的兩個時間週期之間的差集（即表）。我的計劃（請讓我知道任何更好的方法）這樣做是1）collect_set（我知道如何做到這一點），然後2）比較set_difference（我不知道如何做到這一點）。

1）我做的：

CREATE TABLE collect_char_wk1 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod1 
GROUP BY cluster; 

CREATE TABLE collect_char_wk2 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod2 
GROUP BY cluster;

獲得collect_char_wk1：

cluster characteristic 
A  [1,2,3] 
B  [2,3]

，並獲得collect_char_wk2：

cluster characteristic 
A  [1,2] 
B  [2,3,4]

2）是否有一個蜂巢的功能，我可以用來計算集合差異？我不熟悉Java編寫我自己的set_diff（）Hive UDF/UDAF。

結果應該是一個表，set_diff_wk1_to_wk2：

cluster set_diff 
A  1 
B  0

上面是一個玩具例如，我的實際數據是對數百億行與多個列的規模，因此，在計算上有效的解決方案是需要。我的數據存儲在HDFS中，我使用的是HiveQL + Python。

來源

2017-03-27 user2205916

如果您正在嘗試獲取period1中不屬於period2的每個羣集的特徵數，則可以簡單地使用left join和group by。

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff 
from timeperiod1 t1 
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic 
group by t1.cluster

來源

2017-03-27 19:50:12

出於好奇，是否比使用collect_set（）更快？看起來LEFT JOIN需要很長時間，並且可以減少行數，而collect_set（）方法可以顯着減少行數。我在上面添加了一個說明，詳細說明我正在處理數十億行數據（約300億），所以最小化查詢時間是理想的。 – user2205916

@ user2205916 ..試試你的數據並檢查運行時間。很難說哪種方法會更快。 –

select  cluster 

      ,count(*)           as count_total_characteristic 
      ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2 
      ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1 
      ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2 

      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2 
      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1 
      ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2 

from  (select  cluster 
         ,characteristic 
         ,max(case when tab = 1 then 1 else 0 end) as in_1 
         ,max(case when tab = 2 then 1 else 0 end) as in_2 

      from  (   select 1 as tab,cluster,characteristic from timeperiod1 
         union all select 2 as tab,cluster,characteristic from timeperiod2 
         ) t 

      group by cluster 
         ,characteristic 
      ) t 

group by cluster 

order by cluster 
;

+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| A  |       3 |     2 |    1 |    0 | [1,2]  | [3]  | []  | 
| B  |       3 |     2 |    0 |    1 | [2,3]  | []  | [4]  | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+

來源

2017-03-27 20:56:50

您可以使用brickhouse UDF的它有很多功能將執行你所描述的操作。更具體地說，您可以使用set_diff在Wiki中解釋。 README文件將指導您如何創建jar文件。

您可以在您的查詢的jar文件：

ADD jar /PATH/TO/JARFILE/brickhouse-<VERSIONS>-SNAPSHOT.jar

然後使用本指南訪問功能： https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql

希望這有助於。

來源

2017-04-20 00:54:14 DrV

Hive collect_set（）

回答

相關問題