2017-03-27 38 views
1

假設我有兩個表:timeperiod1timeperiod2Hive collect_set()

timeperiod1有模式像這樣:

cluster characteristic 
A  1 
A  2 
A  3 
B  2 
B  3 

timeperiod2具有像這樣的模式:

cluster characteristic 
A  1 
A  2 
B  2 
B  3 
B  4 

我要計算由集羣中的兩個時間週期之間的差集(即表) 。我的計劃(請讓我知道任何更好的方法)這樣做是1)collect_set(我知道如何做到這一點),然後2)比較set_difference(我不知道如何做到這一點)。

1) 我做的:

CREATE TABLE collect_char_wk1 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod1 
GROUP BY cluster; 

CREATE TABLE collect_char_wk2 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod2 
GROUP BY cluster; 

獲得collect_char_wk1

cluster characteristic 
A  [1,2,3] 
B  [2,3] 

,並獲得collect_char_wk2

cluster characteristic 
A  [1,2] 
B  [2,3,4] 

2) 是否有一個蜂巢的功能,我可以用來計算集合差異?我不熟悉Java編寫我自己的set_diff()Hive UDF/UDAF。

結果應該是一個表,set_diff_wk1_to_wk2

cluster set_diff 
A  1 
B  0 

上面是一個玩具例如,我的實際數據是對數百億行與多個列的規模,因此,在計算上有效的解決方案是需要。我的數據存儲在HDFS中,我使用的是HiveQL + Python。

回答

1

如果您正在嘗試獲取period1中不屬於period2的每個羣集的特徵數,則可以簡單地使用left joingroup by

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff 
from timeperiod1 t1 
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic 
group by t1.cluster 
+0

出於好奇,是否比使用collect_set()更快?看起來LEFT JOIN需要很長時間,並且可以減少行數,而collect_set()方法可以顯着減少行數。我在上面添加了一個說明,詳細說明我正在處理數十億行數據(約300億),所以最小化查詢時間是理想的。 – user2205916

+0

@ user2205916 ..試試你的數據並檢查運行時間。很難說哪種方法會更快。 –

1
select  cluster 

      ,count(*)           as count_total_characteristic 
      ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2 
      ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1 
      ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2 

      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2 
      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1 
      ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2 

from  (select  cluster 
         ,characteristic 
         ,max(case when tab = 1 then 1 else 0 end) as in_1 
         ,max(case when tab = 2 then 1 else 0 end) as in_2 

      from  (   select 1 as tab,cluster,characteristic from timeperiod1 
         union all select 2 as tab,cluster,characteristic from timeperiod2 
         ) t 

      group by cluster 
         ,characteristic 
      ) t 

group by cluster 

order by cluster 
; 

+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| A  |       3 |     2 |    1 |    0 | [1,2]  | [3]  | []  | 
| B  |       3 |     2 |    0 |    1 | [2,3]  | []  | [4]  | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+