2017-07-28 54 views
0

我有3個非常大的IP地址表*,並且試圖計算這3個表之間的公共IP數。我已經考慮使用連接和子查詢來查找這三個表之間的IP地址。我怎樣才能找到所有3個表與一個查詢的交集?如何查詢使用Hive相交值的3個大表?

這是不正確的語法,但說明了什麼,我試圖完成:

SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip 

我已經看到了關於如何加入3代表其他的答案,但沒有爲配置單元並沒有什麼這種規模。

*注:

  • 表一:7個十億行
  • 表B:1.8億行
  • 表C:168萬行
  • '表' 是實際上蜂房metastore由S3支持。
  • 每個表中有許多重複的IP
  • 歡迎提供性能建議。
  • 也可以運行Spark SQL查詢,如果使用它而不是Hive是一個更好的主意。

回答

1

一個直接的解決方案:

select  count(*) 

from  (select  1 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 

      having  count(case when tab = 'a' then 1 end) > 0 
        and count(case when tab = 'b' then 1 end) > 0 
        and count(case when tab = 'c' then 1 end) > 0 

      ) t 

這會給你信息不僅關於3個表交點(IN_A = 1,IN_B = 1,in_c = 1),而且信息在所有其它組合:

select  in_a 
      ,in_b 
      ,in_c 
      ,count(*) as ips 

from  (select  max(case when tab = 'a' then 1 end) as in_a 
         ,max(case when tab = 'b' then 1 end) as in_b 
         ,max(case when tab = 'c' then 1 end) as in_c 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 
      ) t 

group by in_a 
      ,in_b 
      ,in_c 

...甚至一些更多的信息:

select  sign(cnt_a)     as in_a 
      ,sign(cnt_b)     as in_b 
      ,sign(cnt_c)     as in_c 

      ,count(*)     as unique_ips 
      ,sum(cnt_total)    as total_ips 
      ,sum(cnt_a)     as total_ips_in_a 
      ,sum(cnt_b)     as total_ips_in_b 
      ,sum(cnt_c)     as total_ips_in_c 

from  (select  count(*)        as cnt_total 
         ,count(case when tab = 'a' then 1 end) as cnt_a 
         ,count(case when tab = 'b' then 1 end) as cnt_b 
         ,count(case when tab = 'c' then 1 end) as cnt_c 

      from  (
            select 'a' as tab,ip from a 
         union all select 'b' as tab,ip from b 
         union all select 'c' as tab,ip from c 
         ) t 

      group by ip 
      ) t 

group by sign(cnt_a) 
      ,sign(cnt_b) 
      ,sign(cnt_c) 
+0

給予好評,因爲一)語法的工作和b)報復downvotes是瘸腿的。 – TheProletariat

+1

@DuduMarkovitz我沒有親自採取 - 我刪除了我的答案,因爲有兩個更好的答案,我的並沒有增加太多。我沒有反擊你投票。 – Siyual

+0

@DuduMarkovitz非常好用,但是我如何得到所有3個表中存在的所有數據?我可以從中選擇一個計數(*),其中in_a = 1,in_b = 1和in_c = 1,還是有更好的方法? – TheProletariat

3

正確的語法是:

SELECT COUNT(DISTINCT a.ip) 
FROM a JOIN 
    b 
    ON a.ip = b.ip JOIN 
    c 
    ON a.ip = c.ip; 

這可能不會在我們的有生之年完成。更好的方法是:

select ip 
from (select distinct a.ip, 1 as which from a union all 
     select distinct b.ip, 2 as which from b union all 
     select distinct c.ip, 3 as which from c 
    ) abc 
group by ip 
having sum(which) = 6; 

被承認,sum(which) = 6只是說,所有三個存在。因爲在子查詢select distinct,你可以這樣做:

having count(*) = 3 
+0

第一條評論的實際大聲笑。我會嘗試第二次,並在16小時左右完成時回覆您。 – TheProletariat

+0

Hive查詢引擎不喜歡那種語法......我逐字拷貝了它(我創建了3個表,名爲字面上的a,b和c以及一個稱爲IP的字段)。有什麼建議麼?語法錯誤:org.apache.hadoop.hive.ql.parse.ParseException:行2:6無法識別輸入附近'(''(''選擇'從源 – TheProletariat

+0

@戈登我很好奇:你怎麼指定' a.ip','b.ip'和'c.ip'在你的子查詢中,而不僅僅是'ip'?你只是從一張表中選擇,這是一個「最佳實踐」 t碰到? – RToyo