如何查詢使用Hive相交值的3個大表？

我有3個非常大的IP地址表*，並且試圖計算這3個表之間的公共IP數。我已經考慮使用連接和子查詢來查找這三個表之間的IP地址。我怎樣才能找到所有3個表與一個查詢的交集？如何查詢使用Hive相交值的3個大表？

這是不正確的語法，但說明了什麼，我試圖完成：

SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip

我已經看到了關於如何加入3代表其他的答案，但沒有爲配置單元並沒有什麼這種規模。

*注：

表一：7個十億行
表B：1.8億行
表C：168萬行
'表' 是實際上蜂房metastore由S3支持。
每個表中有許多重複的IP
歡迎提供性能建議。
也可以運行Spark SQL查詢，如果使用它而不是Hive是一個更好的主意。

來源

2017-07-28 TheProletariat

一個直接的解決方案：

select  count(*) 

from  (select  1 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 

      having  count(case when tab = 'a' then 1 end) > 0 
        and count(case when tab = 'b' then 1 end) > 0 
        and count(case when tab = 'c' then 1 end) > 0 

      ) t

這會給你信息不僅關於3個表交點（IN_A = 1，IN_B = 1，in_c = 1），而且信息在所有其它組合：

select  in_a 
      ,in_b 
      ,in_c 
      ,count(*) as ips 

from  (select  max(case when tab = 'a' then 1 end) as in_a 
         ,max(case when tab = 'b' then 1 end) as in_b 
         ,max(case when tab = 'c' then 1 end) as in_c 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 
      ) t 

group by in_a 
      ,in_b 
      ,in_c

...甚至一些更多的信息：

select  sign(cnt_a)     as in_a 
      ,sign(cnt_b)     as in_b 
      ,sign(cnt_c)     as in_c 

      ,count(*)     as unique_ips 
      ,sum(cnt_total)    as total_ips 
      ,sum(cnt_a)     as total_ips_in_a 
      ,sum(cnt_b)     as total_ips_in_b 
      ,sum(cnt_c)     as total_ips_in_c 

from  (select  count(*)        as cnt_total 
         ,count(case when tab = 'a' then 1 end) as cnt_a 
         ,count(case when tab = 'b' then 1 end) as cnt_b 
         ,count(case when tab = 'c' then 1 end) as cnt_c 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 
      ) t 

group by sign(cnt_a) 
      ,sign(cnt_b) 
      ,sign(cnt_c)

來源

2017-07-28 18:29:39

給予好評，因爲一）語法的工作和b）報復downvotes是瘸腿的。 – TheProletariat

@DuduMarkovitz我沒有親自採取 - 我刪除了我的答案，因爲有兩個更好的答案，我的並沒有增加太多。我沒有反擊你投票。 – Siyual

@DuduMarkovitz非常好用，但是我如何得到所有3個表中存在的所有數據？我可以從中選擇一個計數（*），其中in_a = 1，in_b = 1和in_c = 1，還是有更好的方法？ – TheProletariat

正確的語法是：

SELECT COUNT(DISTINCT a.ip) 
FROM a JOIN 
    b 
    ON a.ip = b.ip JOIN 
    c 
    ON a.ip = c.ip;

這可能不會在我們的有生之年完成。更好的方法是：

select ip 
from (select distinct a.ip, 1 as which from a union all 
     select distinct b.ip, 2 as which from b union all 
     select distinct c.ip, 3 as which from c 
    ) abc 
group by ip 
having sum(which) = 6;

被承認，sum(which) = 6只是說，所有三個存在。因爲在子查詢select distinct，你可以這樣做：

having count(*) = 3

來源

2017-07-28 18:31:21

第一條評論的實際大聲笑。我會嘗試第二次，並在16小時左右完成時回覆您。 – TheProletariat

Hive查詢引擎不喜歡那種語法......我逐字拷貝了它（我創建了3個表，名爲字面上的a，b和c以及一個稱爲IP的字段）。有什麼建議麼？語法錯誤：org.apache.hadoop.hive.ql.parse.ParseException：行2：6無法識別輸入附近'（''（''選擇'從源 – TheProletariat

@戈登我很好奇：你怎麼指定' a.ip'，'b.ip'和'c.ip'在你的子查詢中，而不僅僅是'ip'？你只是從一張表中選擇，這是一個「最佳實踐」 t碰到？ – RToyo

如何查詢使用Hive相交值的3個大表？

回答

相關問題